A maximum entropy approach for the modelling of car-sharing parking dynamics

The science of cities is a relatively new and interdisciplinary topic aimed at studying and characterizing the collective processes that shape the growth and dynamics of urban populations. Amongst other open problems, the forecast of mobility trends in urban spaces is a lively research topic that aims at assisting the design and implementation of efficient transportation policies and inclusive urban planning. To this end, many Machine-Learning models have been put forward to predict mobility patterns. However, most of them are not interpretable -as they build on complex hidden representations of the system configurations- or do not allow for model inspection, thus limiting our understanding of the underlying mechanisms driving the citizen’s daily routines. Here, we tackle this problem by building a fully interpretable statistical model that, incorporating only the minimum number of constraints, can predict different phenomena arising in the city. Using data on the movements of car-sharing vehicles in several Italian cities, we infer a model using the Maximum Entropy (MaxEnt) principle. The model allows for an accurate spatio-temporal prediction of car-sharing vehicles’ presence in different city areas and, thanks to its simple yet general formulation, to precisely perform anomaly detection (e.g., detect strikes and bad weather conditions from car-sharing data only). We compare the forecasting capabilities of our model with different state-of-the-art models explicitly made for time-series forecasting: SARIMA models and Deep Learning Models. We find that MaxEnt models are highly predictive, outperforming SARIMAs while having similar performances of deep Neural Networks - but with advantages of being more interpretable, more flexibile—i.e., they can be applied to different tasks- and being computationally efficient. Our results show that statistical inference might play a fundamental role in building robust and general models describing urban systems phenomena.


INTRODUCTION
In recent years, pressing societal and environmental problems such as population growth, migrations, and climate change, boosted the research on the Science of Cities and the related study of mobility.The availability of large and detailed datasets covering the mobility of individuals at different granularity contributed largely to enhance the interest of researchers in this field [1,2].Being multi-disciplinary, Science of Cities studies embrace diverse areas of research.For example, statistical methods have been applied to city growth [3], multi-layer networks to urban resilience [4] and spatial networks to describe and characterize the structure and the evolution of phenomena arising on it [5].On the modeling side, co-evolution models [6] and agent-based simulations [7] have been used to model stylized facts and take into account policy making.Moreover, other frameworks such as ranking dynamics [8] have been used to study urban environments and universal laws arising in them.
The mobility patterns of individuals diffusing in an urban environment are determined by the interplay of different mechanisms, such as the daily routines of the individuals and environmental constraints, both again regulated by even more fundamental and interrelated phenomena, like wealth, trends, economic relations, socio-political disparities, and cultural movements [9][10][11].Urban environments are complex systems [12], and describing their growth and relation with the surrounding cities is not unanimously understood by the scientific community [13].Being that the commuting phenomenon inside and between cities may be driven by different events and may be related to different causes, different tools and understandings must be developed [14,15].
In this work, we study urban mobility patterns, building a model based on only a few constraints driven by data observations.The model can predict different events in the urban environment with high precision and can be generalized to other dynamical processes unfolding in urban spaces.Here, we propose a Statistical Inference (Maximum Entropy) approach to study and predict urban mobility patterns.Phenomena and relative observables that happen inside the city are complex, they entangle with various indicators and are difficult to predict.To build a powerful, general, and robust statistical model, in principle we need to understand what are the important variables that play a central role in the phenomenon we want to model.To solve it, we need to analyze the dataset we want to study, identify the important dynamical properties and then model it solving the problem of optimizing the resulting entropy.
The data represents the 30 − minute − binned multi-variate time-series of the activity of different zones inside the city.
Being that urban systems are notoriously complex and that the fundamental causes of the observed mobility patterns are various and interrelated, our methodology is novel in this field in the sense that builds the most general model constrained to reproduce the correlations observed.
In section Results we present the results for the Metropolitan City of Milan: multi-variate forecasting of the zones activity and forecasting outliers events, discussed in Discussion.We also settle the basis for the formal derivation of the model.We give an introduction to Maximum Entropy models, the formalism used in the following sessions and the optimization algorithm used.Afterward,we introduce the formalism to compute an approximated Log-Likelihood.In Additional Informations can be found similar results for the cities of Rome, Florence, and Turin.
In section Methods we describe the data and the variables in use.To find out which are the important variables to represent, we do a historical analysis and observe a dynamics of contraction and dilation of the time-shifted correlations between different zones.With this, we understand the important effects and and finally we define and derive the formulae for the ME model describing our data.
We use MaxEnt inference to obtain the parameters (using gradient ascent algorithm) and reproduce lag-correlations of definite positive time series.We derive a model that is highly predictive at least as more sophisticated models that take into account nonlinear correlations and have more parameters.We also use the obtained statistical model to find extrema events (outliers, such as strikes and bad weather days).As a result, we find the dynamics of cars' presence in urban areas to be extremely rich and complex.We infer the couplings parameters between the activity profiles of different areas and use them to project the cars' locations in time efficiently.

RESULTS
In this section, given a brief introduction to the method and the data studied, we present the results.For a more in depth description, we refer to sec.Methods.

Data
The data we use is a normalised multivariate time-series representing the parking activity of the different municipalities of the city.The procedure to obtain the final form of the time-series in described in sec.Methods.In Fig. 1 we show an example of activity for two zones of our dataset for the Metropolitan City of Milan.
FIG. 1: Time series activity data for the city center (a) and for one of the suburbs (b).

Maximum Entropy Principle
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge is the one with the largest entropy, in the context of precisely stated prior data.According to this principle, the distribution with maximal information entropy is the best choice.The principle was first shown by E. T. Jaynes in two papers in the late fifties where he emphasized a natural correspondence between statistical mechanics and information theory [16,17].
In particular, Jaynes offered a new and very general rationale on why the Gibbsian method of statistical mechanics works.He showed that statistical mechanics, and in particular Ensemble Theory, can be seen just as a particular application of information theory, hence there is a strict correspondence between the entropy of statistical mechanics and the Shannon information entropy.
Maximum Entropy models unveiled interesting results throughout the years for a large variety of systems, like flocking birds [18], proteins [19], brain [20] and social systems [21].
We will then implement this approach to define the model of our real-world system in the following sections.A more general introduction to the maximum entropy formalism is out of scope here and we refer to the existing literature for details [22][23][24][25].
The probability distribution with Maximum Entropy P M E results from the extreme condition of the so-called Generalized Entropy: where is the Shannon Entropy of the probability distribution P (X).The maximum of the Generalized Entropy is the maximum of the Entropy of the model when it is subject to constraints.Computing the functional-derivative (1) with respect to P (X) and equating to zero results in: where is the normalization (making a parallel with statistical physics, can be called Partition Function).Z(θ) is written as a sum if Ω is discrete.Hence, the maximum entropy probability distribution is a Boltzmann distribution in the canonical ensemble at temperature T = 1K, with effective Hamiltonian H It must be noticed that the minimization of the generalized entropy is equivalent to the maximization of the experimental average of the likelihood: In other words, the θ k are chosen by imposing the experimental constraints on Entropy or, equivalently, by maximizing the global, experimental likelihood according to a model with the constraints cited above.Focusing on this last sentence, we can say that the optimal parameters of θ (called effective couplings) can be obtained through Maximum Likelihood, but only once one has assumed (by the principle of Maximum Entropy) that the most probable distribution has the form of P M E .
Given the generative model probability distribution of configurations P (X | θ) and its corresponding partition function by log Z(θ), the estimator of θ can be found by maximizing the log-likelihood: Having chosen the log-likelihood as our cost function, we still need to specify a procedure to maximize it with respect to the parameters.
One common choice that is widely employed when training energy-based models is Gradient Descent [25] or its variations: optimization is taken w.r.t. the gradient direction.Once one has chosen the appropriate cost function L, the algorithm calculates the gradient of the cost function concerning the model parameters.The update equation is: Typically, the difficulty in these kind of problems is to evaluate log Z(θ) and its derivatives.The reason is that the partition function is rarely an exact integral, and it can be calculated exactly only in a few cases.However, it is still possible to find ways to approximate it and compute approximated gradients.

Pseudo-Log-Likelihood (PLL) Maximization
Pseudo-likelihood is an alternative method compared to the likelihood function and leads to the exact inference of model parameters in the limit of an infinite number of samples [26,27].Let's consider the log-likelihood function L(θ) = log P (X | θ) data .In some cases we are not able to compute the partition function Z(θ), but it is possible to derive exactly the conditional probability of one component of X with respect to the others, i.e.P (X j |X −j , θ) where X −j indicates the vector X without the j-th component.
In this case, we can write an approximated likelihood called Pseudo-log-likelihood, which takes the form: The model we will introduce in this work does not have an explicit form for the partition function, but Eq. ( 8) and its derivatives can be calculated exactly.Thus, the Pseudo-log-likelihood is a convenient cost function for our problem.

Definition of MaxEnt model
We have seen that the Maximum Entropy inference scheme requires the definition of some observables that are supposed to be relevant to the system under study.Being that one aims to predict the evolution of the different x i (t), the most simple choice is to consider their correlations.
As a preliminary analysis, we study the correlation between the activity x i (t) of most central zone within the city (highlighted in red) and all the other zones with a shift in time (i.e.we correlate x i (t) with x j (t − δ) for al j = i and some δ > 0).The measure of correlation between two vectors u and v represented is We see that the areas with large correlations vary with δ so that they clustered around the central area for small values, and they become peripherals when δ ∼ 31.When δ ∼ 48 they start to cluster around the central area again.An opposite behavior is shown when repeating the same procedure focusing on a peripheral zone.We perform a historical analysis of time-shifted correlations, observing contraction and dilation w.r.t.different zone in a one-day periodicity.Due to the correlation dependency on the time shift, we have to include it in the observables used to describe the system.Hence, we chose as observables all the shifted correlations between all the couples of zones defined as, for i, j = 1, ...., N (with N as the number of zones we took in consideration) and δ = 0, ...., ∆.Another common choice is to also fix the average value of all the variables of the system.We took From these, we obtain the equation for the probability: where a i is the i-th component's standard deviation, h i is its mean and J δ ij are the time shifted interactions.Writing v i (t) = h i + δ J δ ij x j (t − δ) one can obtain:
To quantify the predictive capabilities of our models, we used the Mean Absolute Error (MAE), the Mean Squared Error (MSE) and the Coefficient of Determination, also known as R 2 [28].We typically find good predictive capabilities on the test set ( as in fig. 4 and the exact values for each predictor are found in fig I).To check for the best value, we used the average value at each time step during the day in the training set as a baseline model.For ∆ = 48 (exactly the day-periodicity) and λ = 0.005, we find our model's best performance, so we will use these values.
To assess the predictive power of our approach against more complex models, we compared it with standard statistical inference and Machine Learning methods.In particular: 1. a SARIMA model [29]: we performed a grid search over the parameters using AIC [30] as metrics, the best predictive model is, referring to standard litherature on the topic, {p = 2, d = 0, q = 1, P = 2, D = 1, Q = 2, m = 48}.In tab.II the table of the grid search.
2. NN and LSTM: each of them has 3 hidden layers with a N nodes = {64, 128, 256, 512}.Between the layers, we performed dropout techniques and standard non-linear activation functions (ReLU).Performing a grid-search over the other Machine Learning hyperparameters, the layers with 128 and 256 nodes perform better.
Fig. 2 shows that our model outperforms SARIMA and performs as well as the Machine Learning ones.Being that the LSTM one always performs slightly better than the FeedForward, we will use the first as a reference.Fig. 3 shows this in more detail: SARIMA fails to reproduce variations and it is mostly focused on seasonality.Our model works as well as NNs but with two orders of magnitude of fewer parameters (109 w.r.t., for the 128-nodes configuration, 39040 for the FeedForward).
In fig.7 we show data and results for the cities of Rome, Florence, and Turin.

Extreme events prediction
As already pointed out, prediction is a possible application of our procedure, and we compared the results with state-of-the-art forecasting techniques.Another possible application is the detection of outliers in the data.We see from Fig. 1 that our car sharing data exhibits quite regular patterns on different days.These patterns can be perturbed by a wide variety of external events, influencing the demand for car-sharing vehicles as well as the traffic patterns in the city.If we compute the log-likelihood from equation ( 17) restricting to one of the perturbed days, we would expect it to be particularly low, implying that the model has lower predictive power than usual.Hence, we can use the log-likelihood L from equation ( 17) as a score in a binary classification task.After training the model with a specific parameter choice, we consider a set of days in the test dataset, d 1 . . .d D .Each one is a set of consecutive time steps: The log-likelihood of a single day is just We can now assign a label L(d i ) = 0 if d i is a standard day and L(d i ) = 1 if it is a day where a certain external event occurred.We considered two types of external events: 1. Weather conditions: on that day there was fog, a storm or it was raining.
2. Strikes: in that day there was a strike in the city capable of affecting car traffic.We considered Taxi strikes, Public Transportation strikes, and general strikes involving all working categories.
We had 50% of days labeled as 1 among the days in the test dataset.In fig. 5 we show the ROC curve for the classification of such days, using the log-likelihood trained with the parameters {∆ = 48, λ = 0.006}.The Area under the ROC curve (AuROC) is 0.81, indicating that L is a good indicator of whether an external event occurred on a specific day.

DISCUSSION
In this work, we addressed the problem of building statistical and machine learning models to predict mobility patterns within the Milan Metropolitan Area.We focused on the prediction of how many car-sharing vehicles are parked within a specific area at different times.To do this, we analyzed a longitudinal dataset of parking events coming from the APIs of the Italian car-sharing company "Enjoy".The processed data consisted of a collection of time series representing the number of parked vehicles in a given area at a given time.
To predict the evolution of these time series, we used a Maximum Entropy (ME) approach that requires the identification of relevant observables from the data and use these observables to define probability density functions depending on some parameters.
Maximum Entropy modeling has proven to be very effective at studying and determining the activity patterns inside the city (Fig. 3).
We compared our model with other models specifically built for time series forecasting.In particular, we used a SARIMA model, a Feed-Forward Neural Network, and a Long Short-Term Memory Neural Network.
Maximum Entropy models outperformed SARIMA models w.r.t.all the metrics used in the evaluation.Our model is as predictive as LSTMs Neural Networks, but using two orders of magnitude fewer parameters , interpretability of the parameters, and the possibility of performing different studies using the same model.
Finally, we used the statistical model also to identify extreme events affecting urban traffic, such as bad weather conditions or strikes.
In conclusion, this work is a first attempt at modeling the inherently complex interactions and causalities that shape the fluxes of people and vehicles within urban spaces.We showed that the interplay of the activity of different areas at a different time is far from trivial.Indeed, we found the models derived within the ME framework to accurately reproduce the empirical time series using a mixture of correlations between different areas at different times.
Moreover, our finding that linear correlations are enough to predict the mobility patterns (wrt neural networks) is an interesting result that requires additional investigations.
In Additional Information we show the prediction ability on other three cities: Rome, Florence, and Turin.Given our results, several research directions can be taken to improve and extend the results: • a more extensive study on the effects of seasonality could help in building better models.Specific seasondependent models could be built by taking into account larger datasets.
• Evaluate how the prediction ability of the Maximum Entropy models is dependent upon the structure of the city or the resolution of the time series.
• Entangling mobility patterns with other city indicators, such as socio-political disparities and economic indicators, can lead to a better model that depicts human distribution.
• The inclusion of non-linear interaction in the ME models could be hard if made by defining standard observables.Instead, hidden variables approaches could be take into account, e.g.Restricted Boltzmann Machines [31].
• The ME models could be adapted to perform other anomaly detection tasks, i.e. identifying parts of the time series which are not standard fluctuations of the system [32,33].

Dataset
The dataset used in this work is a log of the cars' locations, belonging to Enjoy, a popular Italian car-sharing service.Although the service is active in six major Italian cities, we focus our analysis on the city of Milan.The data has been collected via the Enjoy APIs, scraping the information through a client account.We obtained information about the area of service (i.e., the limit where people can start and end their ride), the location of vehicles, and the points of interest (POIs) related to the service (such as fuel stations and reserved parkings) from the API endpoints.The endpoints used have base URL https://enjoy.eni.com/ajax/ and the following endpoints: for the area, for cars, and for points of interests.
We collected the data recording all the movements of Enjoy cars within the city of Milan in 2017.Through the scraping procedure, we collected a series of events, that are divided into two categories: parking events and travel events of each Enjoy car we could observe.Each event comes with a set of information such as the car plate, the time at which the event occurred (with the precision of seconds), and the latitude and longitude of the parking spot in the case of parking events.Starting and arrival points in the case of travel events are also recorded.In the rest of this article, we will focus mainly on parking events, but our methods and analyses could be also applied to study the volume of travel events.The final aim of the work was to predict the number of vehicles parked at a given time in each and every district of the city.To this end, we divided the area of service into the different municipalities (we will call them zones).Different tessellation strategies (e.g., using hexagons with a 900m side) have been considered, but the municipality division provides more statistichally stable training data, and thus more precise models.
For each type of event (parked car) that we collected, we defined the activity of a zone as the number of events (i.e., cars parked) occurring there in a certain fixed amount of time δt.Indicating each zone with the index i, we have obtained for each and every one of them a discrete time series x i (t) representing the activity of the zone i in the time frame [t, t + δt].We will indicate with t = T the last time frame observed in the data.In the following, we will use a δt = 1800 s, corresponding to 30 minutes.This time bin width has been chosen so to have a stable average activity x i (t) and a stable variance σ(x i (t)) over all the zones.Indeed, narrower or larger time bins feature unstable mean and/or variance in time: the 30 − minutes binning is significantly stable throughout all the observation period.This characteristic helps the model to generalize and predict in a better way as the distributions of the modeled quantities are not changing too much in time.
Some of the models we would like to implement require real variables distributed over R.However, the x i (t) activity we have defined so far belongs to N by definition.For this reason, we defined another kind of variable dividing the activity by the standard deviation: where x is the original activity data, and σ is the standard deviation of the activity in time.In this case, for the population we take into account the typical day: the std is taken w.r.t. the same time bin for every day.So, Eq. ( 13) gets into: where t δt = t mod (δt) is the integer division of t by the δt time bin width, µ i (t δt ) is the average over all the activities of the zone i at times with the same value of t mod δt, and σ i (t δt ) is their standard deviation.To keep the notation formalism, we will indicate z i (t) with x i (t), keeping in mind that now x refers to a normalized activity.
From now on, we will work on the normalized x i (t): these indicate how much a given area is more "active" -i.e., has more cars parked -concerning the average volume of that time bin t (typical day activity) by weighting this fluctuation with the standard deviation observed for that zone-hour volume.In this way, we can compare the signal of areas with high fluctuations and high activities with less frequented areas around the city.
The final step when processing data for inference and Machine Learning is to divide the data into a train set and a test set.The models will be then trained using the first dataset, and their precision will be tested on the second one so to check their ability to generalize to unseen data.Different kinds of splittings have been done, like random splitting or taking the first part of the time series as training and the last one as a test.Similar results are obtained.
As a final remark, in the following we will indicate with t the time bin of activity, i.e. x i (t) will indicate the activity of the zone i in a time range [tδt, (t + 1)δt].

Pseudo-Log-Likelihood Maximization
The Boltzmann probability is defined over the whole time series of all the zones, i.e.
From this, it is straightforward to define the conditional probability of one-time step x(t) concerning all the previous ones: Using equation ( 8), we can define the Pseudo-Log-Likelihood as: Here, using Eq. ( 16) and substituting the functional form of the two total probabilities, we obtain with Substituting in eq. ( 17), we get: and we can calculate the gradients of the L pseudo w.r.t. the parameters: where and The fact that the gradients and the cost function can be computed exactly makes the inference of the parameters relatively easy with respect to other cases where they need to be approximated.
Once the parameters of the model have been inferred with some method, it is possible to use it to predict the temporal evolution of the normalized activities of the system.Given some certain state of the system unit time t − 1, i.e. (x(t − 1), . . .x(t − ∆)) (past time steps further than ∆ from t are not relevant), we can use equation (18) to predict the next step x(t).Since the probability in ( 18) is a normal distribution whose average is completely defined by (x(t − 1), . . .x(t − ∆)), the best prediction of x i (t) is the average of the distribution x i (t) = 1 2ai v i (t) + 1 Zi(t) .In other words, we are using the generative model to make a discrimination task.This makes possible the comparison of this model with standard machine learning ones, by comparing their precision in the prediction of the time series.To avoid over-fitting, we used L1-regularization [34,35].An in-depth description of the technique can be found in the references.In practice, the cost function that has to be optimized is: The first term of this sum is the Log-likelihood; instead, the second term is the regularization term.
If the gradient is performed, we obtain: where sign is the sign function.
The training curves show no sign of overfitting,as the log-likelihood asymptotically stabilizes for the validation and train sets (see Fig. 6 in Appendix).

FIG. 2 :
FIG. 2: Model comparison using different metrics.We see similar results between the MaxEnt model and the Deep Learning ones.

TABLE I :
R 2 values w.r.t.∆ and λ values, resulting from 100000 steps of training.

TABLE II :
Grid search over the SARIMA parameters FIG. 6: Log-Likelihood training for different values of ∆.