Quality assessment and community detection methods for anonymized mobility data in the Italian Covid context

We discuss how to assess the reliability of partial, anonymized mobility data and compare two different methods to identify spatial communities based on movements: Greedy Modularity Clustering (GMC) and the novel Critical Variable Selection (CVS). These capture different aspects of mobility: direct population fluxes (GMC) and the probability for individuals to move between two nodes (CVS). As a test case, we consider movements of Italians before and during the SARS-Cov2 pandemic, using Facebook users’ data and publicly available information from the Italian National Institute of Statistics (Istat) to construct daily mobility networks at the interprovincial level. Using the Perron-Frobenius (PF) theorem, we show how the mean stochastic network has a stationary population density state comparable with data from Istat, and how this ceases to be the case if even a moderate amount of pruning is applied to the network. We then identify the first two national lockdowns through temporal clustering of the mobility networks, define two representative graphs for the lockdown and non-lockdown conditions and perform optimal spatial community identification on both graphs using the GMC and CVS approaches. Despite the fundamental differences in the methods, the variation of information (VI) between them assesses that they return similar partitions of the Italian provincial networks in both situations. The information provided can be used to inform policy, for example, to define an optimal scale for lockdown measures. Our approach is general and can be applied to other countries or geographical scales.

Appendix C: About Perron-Frobenius (PF) theorem for stochastic matrix In the graph identified by the mean matrix Π there is a non-zero probability to reach any node from any other node in a finite number of steps, that is, the graph is strongly connected and aperiodic.(To say a graph is aperiodic is equivalent to saying that its representative matrix is irreducible or saying that the random walk on the graph is ergodic.)Then, the transition matrix representing the graph is non-negative and irreducible.
For a general non-negative irreducible matrix, Π, the PF theorem then ensures that the highest eigenvalue λ * of Π is not degenerate.It is often called the PF eigenvalue and we name PF left eigenvector l * (and right eigenvector r * ) its associated eigenvectors.
A consequence of the theorem in this case is also that any distribution on which we apply the matrix successively, will concentrate, in the long time (long path limit), to the stationary density vector ρ * i = l * i .r* i .We explain it here in our specific case where the matrix is stochastic.The normalization of Π is such that it is stochastic, i.e. each its raws sum to 1, i.e 1 is eigenvalue of Π and is associated with the right eigenvector r * = 1 np = (1, 1, . . ., 1) T and l * : Moreover, one can also show that in this case, 1 is the maximum eigenvalue possible and the PF theorem ensure it is unique, as well as its associated eigenvectors.Therefore, for stochastic matrix, we commonly identify ρ * = l * as the stationary density vector but in general, r * i may be something else than the 1 and the PF eigenvalue different than 1.
In the following, we show the long-time limit convergence in our case.
Because the PF left eigenvector of Π is the unique invariant, it ensures the detail balance of the associated Markov process: Multiplying the left and the right by ρ * −1/2 i and ρ * −1/2 j , we obtain that Hence the matrix S defined by the elements is equal to its own transposed, and so is symmetric.Defining the transformation U = diag(ρ * 1/2 i ) we have: then S have the same eigenvalues (for S, left and right eigenvector are the same) and is symmetric, so diagonalizable in real space, so Π itself is diagonalizable.So there exists a mapping, i.e. a base of the vector space, O such that : The PF theorem ensures that for our stochastic matrix λ * = 1 > |λ i | ≥ 0, i = 1, .., N , Hence, in the long time limit (or long path limit).
Λ t ∼ diag(1, 0, 0, 0, ..., 0) The principal (or Perron-Frobenius) eigenvalue λ * = 1 will dominate and any non-trivial distribution ρ over the nodes will converge to the unique stationary density vector: We see that the importance of small link to predict the population vector.On the top panel, the full Network of the mean matrix Π is represented with all the links (self-loops excluded).The smaller links are progressively suppressed, with increasing cutoff (2, 75.10 −5 , 2, 74.10 −4 , 3.10 −4 , 3, 5.10 −4 ), and the prunned matrix is normalised to be a stochastic matrix.The Perron Fobenius first left eigenvectors of each pruning ρ * is then compute and shown on the second colomn versus ρ Istat the Istat data.The right most column is the standard deviation with this vector: we see a global increasing of these deviation from the Italian State data.For higher cuttoff, the graph becomes only weakly connected provicking some provinces to be 'sink' of the random walk.

Appendix D: Transition matrices time series
We display here more information contained in the time series of weekly-averaged daily transition matrices.First, the temporal clustering process present in Material and Methods section 3.3 can be represented in a tree (dendrogram) in which the child branches at each step represent the pairs of clusters that merge into a parent branch.We report this hierarchical clustering dendrogram in Fig. ??.The length of the branches (y-axis) corresponds to the cophenetic  The movement pattern of single provinces can be brought to collapse on two master curves with an appropriate rescaling, see Fig. ??a)-d).Specifically, this can be done by considering the normalized probability to move out of a province,shown in Fig. ??a), b) , and the normalized probability to move into a province P i in (t) ) reported in Fig. ??c), d).As can be seen from panels c), e) all provinces display a similar behavior in these two quantities, and the first two lockdowns become apparent as periods of low mobility.The Z-score, i.e. the time average of the fluctuation of P i in (t) and P i out (t) with respect to the mean over provinces, is defined and displayed in supporting information (Fig. S5).
Interestingly, we also note that some provinces show a large deviation in both quantities in correspondence of summer and winter months.To rationalize this, we look at the provinces showing peaks of mobility in those periods, and found them to correspond with those having a high touristic vocation, as for example Belluno (BL) and Trento (TN), near the Dolomites, and Sicilian provinces, see Fig. ??a), b).While at first the fact that Rome and Venice (VE) do not show these peaks might be unexpected, we recall that our data only follow the movement of Italian citizens, and that during Covid there was a strong push to take holidays outside of cities. Outgoing probability for each cluster, the colors correspond to the clusters of the non-confined case using GMC.Right Panel: Ingoing probability for each cluster the colors correspond to the clusters of the non-confined case using GMC .

Appendix F: Z-score of the probability of going in and out of provinces
To better see which provinces differ the most from the mean trend we compute the Z-score, which is defined for the time series X i (t) of province i as: with µ(t) = ⟨X i (t)⟩ being the average over provinces and σ(t) = ⟨X i (t) − µ(t)⟩ the standard deviation at time t.In Fig. ??, we show at the top a map with the Z score for the outgoing probabilities, and on the bottom one for ingoing probabilities for each province.

Appendix G: Clustering of the mean current Matrix
In Fig. ??, on the top is displayed a representation of the mean current matrix sorted by clusters and by weight, one sees that the method is satisfactory giving well-defined blocks corresponding to each community.On the bottom panel, we see that the clusters correspond, apart from very few border cases,(and Umbria is split apart) to a group of regions in Italy.In detail, for the ten clusters found, we have: • The green cluster corresponds perfectly to the "Triveneto" region (that is, Veneto, Friuli-Venezia-Giulia, and the provinces of Trento and Bolzano).
• The red one to Lombardia with the exception of Mantova plus the two provinces of Verbano/Cusio/Ossola and Novara (belonging to Piemonte) and Piacenza (belonging to Emilia-Romagna).
• The dark purple corresponds to the region of Valle d'Aosta and Piemonte (minus VB and NO) and Liguria, at the exception of Spezia.
• The yellow cluster is Toscana plus the provinces of Spezia (Liguria) and Perugia (Umbria) • The orange one corresponds to the region of Emilia-Romagna at the exception of Piacenza (PC) and adds the provinces of Pesaro/Urbino (Marche) and Mantova (Lombardia).
• The grey cluster is the regions of Marche (minus PU), Abruzzo (minus AQ) and Molise (minus IS) • The teal one matches with the regions of Lazio and Sardegna (plus TE (Umbria) and AQ (Abruzzo)) • The light purple corresponds to the region of Campania plus the province of Isernia belonging to Molise.
• The light blue cluster corresponds perfectly o the regions of Puglia and Basilicata • Finally, the blue cluster perfectly to the regions of Calabria and Sicilia.We display here the full network visualization of the two most representatives obtained for the two temporal clusters, the widths of the links are proportional to the logarithm of the transition probability.Respectively, the two data sources contained the following datasets that were of interest to us: • The Movements Between Administrative Regions dataset describes the number of Facebook users that move between two NUTS-3 administrative regions (aka province).The temporal aggregation of the dataset is of 8 hours, meaning that if a person is checked-in in region A in a certain time frame, and the same person is found to be checked-in in another region B in the subsequent time frame, then a movement between regions A and B are counted.A 24-hour the day is divided into three time frames: 00:00-08:00, 08:00-16:00 and 16:00-24:00.Two types of data are considered: the baseline, which is computed by taking the average on the same weekday for the same weekdays, and the people during the crisis, which is the actual number of people detected in the specified DateTime.Only users of the Facebook app that have the Location History option enabled are counted, and also if the aggregation yields counts under 10 units then the datum is discarded.
• The New Positive Cases By Date dataset describes the number of new positive SARS-CoV-2 cases, aggregated by date and province of detection.This dataset does not suffer from the lag between detection and publication, unlike the Dipartimento della Protezione Civile (Department for Civil Defense) data.The number of cases is the result of a window average over a week, where the final result is the day in the middle of the week (the fourth day of the week).

The choice of the stack
In order to perform the extraction, loading, and transformation of the data various paths have been explored, but our choice fell on the current technological stack.
• Python is the main scripting language and piece of software used throughout the whole pipeline.Its userfriendliness, its widespread use among both the industry and researchers, and the availability of great libraries for data science and visualization made it our natural choice for our purposes.In particular, the libraries mainly used by us are Selenium (web browser automation) and Pandas (data analysis and manipulation).
• Miller is a toolkit for data munging.It allows quick CSV manipulations and it contains several powerful commands, that can also be chained one after the other.
• Bash is used for integrating and preprocessing various data sources.It is extremely flexible and compatible with most of Unix-like systems, and for certain types of data science workloads it can quickly and efficiently get the work done.
• DuckDB is an embeddable analytical database.It is similar to SQLite, in that the database system runs within a host process, but it is optimized for analytical (OLAP) workloads.It allows for manipulations à la Pandas but also has full-query optimization and transactional storage.It is a good choice for our purposes since it allows faster queries to be made, it does not require the maintenance of a DB stack and it integrates very well with Python thanks to the DuckDB Python API.

Gathering the data
The Facebook Data for Good data can only be downloaded by using an online interface, and each transaction is size-capped (i.e.Movements Between Administrative Regions data for more than a two-weeks period would not be downloaded).In order to facilitate and speed up the sourcing of the data, a bulk download tool was developed.The tool makes use of Selenium, a Python library for browser automation, that allows automated workflows that simulate human interaction with a browser.The resulting raw data are available as zip archives containing many CSV files.The COVID-19 data by ISS/INFN is published as a single zip archive containing many CSV files, one for each aggregation, data type and province/region.Reference tables are also essential for the analysis, as they allow data integration between incoherent definitions (in particular, concerning spatial aggregation units) between different datasets.Some of them are manually compiled, and others are aggregate data extracted from the original datasets: • The provinces identifications conversion table has been created manually.It contains the correspondence between IDs for provinces in the Facebook dataset, the ISS/INFN dataset, and the "car number plate code" two letters characters).
• The locations reference table contains the latitude and longitude for each province.

Cleaning and loading the data
The raw data is then loaded into our database for further analysis.First of all, the archives are unpacked and the data is cleaned for our purposes with the use of Miller.The following operations are performed: • The data is filtered in order to get only data for Italian provinces (generally Facebook uses rectangular bounding boxes to get subsets of data).
• Minor changes in data formats are operated (such as missing date-time imputation and format correction for date-time strings).
• Null entries are discarded.
• Only columns of our interest are selected.
Then the data are piped through an SQL COPY command, that loads it in a DuckDB database.

Transforming the data
This is the last step of our data preparation pipeline.In this step, we transform the raw data contained in the database into SQL tables with SQL views, in such a way that they can be accessed easily and are expressed in a manner that is optimal for the analysis purposes of our research.
The Movements Between Administrative Regions dataset has rows aggregated and summed over with the new definitions of provinces as defined in the reference table.
Starting from this table, then multiple views are created: • total number of people moving from each origin place by date; • total number of people moving between places summed over by each day; • daily probability that a movement between places happens (aka transition matrix ); • total probability that a movement between places happens; • weekly rolling average of the daily probability of movement between places.
The COVID-19 ISS row dataset is also aggregated and summed over the province using the new definitions as defined in the reference table.??.
FIG. SI.2.PerronFrobenius theorem checked by pruning the links of the mean network of Italian mobility.We see that the importance of small link to predict the population vector.On the top panel, the full Network of the mean matrix Π is represented with all the links (self-loops excluded).The smaller links are progressively suppressed, with increasing cutoff (2, 75.10 −5 , 2, 74.10 −4 , 3.10 −4 , 3, 5.10 −4 ), and the prunned matrix is normalised to be a stochastic matrix.The Perron Fobenius first left eigenvectors of each pruning ρ * is then compute and shown on the second colomn versus ρ Istat the Istat data.The right most column is the standard deviation with this vector: we see a global increasing of these deviation from the Italian State data.For higher cuttoff, the graph becomes only weakly connected provicking some provinces to be 'sink' of the random walk.
FIG. SI.3.Hierarchical clustering dendrogram of the day-by-day transition matrices: Left: full dendrogram, Right: dendrogram cut at the level of 5 clusters with in the x-axis the number of days of each clusters.
FIG. SI.6.Map of the 2-year average Z score by province.The Z scores defined Eq. ?? are the average fluctuations by provinces of the probability of moving in (Z score IN) and out (Z score OUT) of the province.
FIG. SI.7.Italian provinces clustered using GMC by communities using the all time-averaged matrix.Left panel: Graph representation of the community clustering with colors corresponding to the different clusters, the widths of the links are proportional to the logarithm of the transition probability.At the exception of Sardinia that is in Rome cluster here, the clustering is the same than non-confined most representative matrix using GMC.Right panel: Representation of the clustered mean current matrix, for visualization the shades are in log10 of the mean probabilities of going from one province to another.
FIG. SI.8.Directed graph representation of the most representative matrices for non-confined cluster C0 (left panel) and confined one C1 (right panel) the optimal clustering in using the greedy modularity method.
Appendix E: In and outgoing probability for each spatial cluster