Scientific Reports | Article Open
Unique in the Crowd: The privacy bounds of human mobility
- Journal name:
- Scientific Reports
- Volume:
- 3,
- Article number:
- 1376
- DOI:
- doi:10.1038/srep01376
- Received
- Accepted
- Published
We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual's privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.
Subject terms:
At a glance
Figures
-
Figure 1: (A) Trace of an anonymized mobile phone user during a day.The dots represent the times and locations where the user made or received a call. Every time the user has such an interaction, the closest antenna that routes the call is recorded. (B) The same user's trace as recorded in a mobility database. The Voronoi lattice, represented by the grey lines, are an approximation of the antennas reception areas, the most precise location information available to us. The user's interaction times are here recorded with a precision of one hour. (C) The same individual's trace when we lower the resolution of our dataset through spatial and temporal aggregation. Antennas are aggregated in clusters of size two and their associated regions are merged. The user's interaction are recorded with a precision of two hours. Such spatial and temporal aggregation render the 8:32 am and 9:15 am interactions indistinguishable.
-
Figure 2: (A) Ip = 2 means that the information available to the attacker consist of two 7am-8am spatio-temporal points (I and II).In this case, the target was in zone I between 9am to 10am and in zone II between 12pm to 1pm. In this example, the traces of two anonymized users (red and green) are compatible with the constraints defined by Ip = 2. The subset S(Ip = 2) contains more than one trace and is therefore not unique. However, the green trace would be uniquely characterized if a third point, zone III between 3pm and 4pm, is added (Ip = 3). (B) The uniqueness of traces with respect to the number p of given spatio-temporal points (Ip). The green bars represent the fraction of unique traces, i.e. |S(Ip)| = 1. The blue bars represent the fraction of |S(Ip)| ≤ 2. Therefore knowing as few as four spatio-temporal points taken at random (Ip = 4) is enough to uniquely characterize 95% of the traces amongst 1.5 M users. (C) Box-plot of the minimum number of spatio-temporal points needed to uniquely characterize every trace on the non-aggregated database. At most eleven points are enough to uniquely characterize all considered traces.
-
Figure 3: (A) Probability density function of the amount of recorded spatio-temporal points per user during a month.(B) Probability density function of the median inter-interaction time with the service. (C) The number of antennas per region is correlated with its population (R2 = .6426). These plots strongly emphasize the discrete character of our dataset and its similarities with datasets such as the one collected by smartphone apps.
-
Figure 4: Uniqueness of traces [ε] when we lower the resolution of the dataset with (A) p = 4 and (D) p = 10 points.It is easier to attack a dataset that is coarse on one dimension and fine along another than a medium-grained dataset along both dimensions. Given four spatio-temporal points, more than 60% of the traces are uniquely characterized in a dataset with an h = 15-hours temporal resolution while less than 40% of the traces are uniquely characterized in a dataset with a temporal resolution of h = 7 hours and with clusters of v = 7 antennas. The region covered by an antenna ranges from 0.15 km2 in urban areas to 15 km2 in rural areas. (B–C) When lowering the temporal or the spatial resolution of the dataset, the uniqueness of traces decrease as a power function ε = α − xβ. (E) While ε decreases according to a power function, its exponent β decreases linearly with the number of points p. Accordingly, a few additional points might be all that is needed to identify an individual in a dataset with a lower resolution.
, β ~ −p/100. Together, these determine the uniqueness of human mobility traces given the traces' resolution and the available outside information. These results should inform future thinking in the collection, use, and protection of mobility data. Going forward, the importance of location data will only increase
