On learning agent-based models from data

Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems from micro-level assumptions. However, a significant drawback of ABMs is their inability to estimate agent-specific (or “micro”) variables, which hinders their ability to make accurate predictions using micro-level data. In this paper, we propose a protocol to learn the latent micro-variables of an ABM from data. We begin by translating an ABM into a probabilistic model characterized by a computationally tractable likelihood. Next, we use a gradient-based expectation maximization algorithm to maximize the likelihood of the latent variables. We showcase the efficacy of our protocol on an ABM of the housing market, where agents with different incomes bid higher prices to live in high-income neighborhoods. Our protocol produces accurate estimates of the latent variables while preserving the general behavior of the ABM. Moreover, our estimates substantially improve the out-of-sample forecasting capabilities of the ABM compared to simpler heuristics. Our protocol encourages modelers to articulate assumptions, consider the inferential process, and spot potential identification problems, thus making it a useful alternative to black-box data assimilation methods.

Agent-Based Models (ABMs) are computational models in which autonomous "agents" interact with one another and with their environment, thereby producing aggregate emergent phenomena [1].ABMs are an extremely successful tool for theory development, that is, to explore the macro-level implications of micro-level assumptions [2].As Axelrod [3] said "whereas the purpose of induction is to find patterns in data [...], the purpose of agent-based modeling is to aid intuition".In line with this focus on theory development, the ability of ABMs to match empirical data and make quantitative forecasts-that is, to learn from data-has been, so far, limited [4,5,6,7].
At a very high level, all ABMs can be described by the formula where Z t are the variables of interest in the system, 1 These authors contributed equally to this work. 2To whom correspondence should be addressed.Email: corrado.monti@centai.eu;marcopangallo@gmail.com; gdfm@acm.org;bonchi@centai.euΘ is a set of parameters, P t is a probability function implicitly defined by model specifications, and t is the discrete time index.Typically, Θ has relatively few components and a fixed dimensionality, is interpretable by a domain scientist, and is the only tuning knob of the model.Z is at the same time the state and the output of the model.Each component of Z typically refers to an individual agent, which results in high dimensionality.Some of Z is observable, while the rest is latent.
Most of the efforts in learning agent-based models from data have focused on parameter calibration.This task refers to the process of finding a parametrization Θ that can reproduce some macroscopic characteristic of the data, and it typically boils down to comparing a few summary statistics of aggregate empirical and model-generated variables (e.g., time series) [6,8].Summary statistics are valuable to focus on the most important characteristics of the data that the modeler wants to explain, but they often have to be chosen arbitrarily, and may hide very different underlying patterns (as in the well-known Anscombe quartet).

A B C
Figure 1: Our approach compared to a standard approach towards calibrating an ABM of the housing market.
(A) Focusing on the boroughs in the center of London (bottom layer), we consider the yearly average of transaction prices (middle layer) as an example of an observed variable, and the distribution of agent incomes (top layer) as an example of a latent variable.(B) For each borough, we observe a time series of transaction prices.In the standard approach to calibration, modelers could typically calibrate some parameters Θ (such as the probability for inhabitants to put their home on sale) by computing the moments of transaction prices across boroughs and years (as represented through a box plot), and minimizing the distance with the same moments in model-generated time series.In our approach, instead, we are able to calibrate the evolution of latent variables Z -in this example, borough-level agent incomes-by exploiting all information contained in time series, rather than reducing this information to specific summary statistics.(C) In the model, prices depend on agent incomes.Thus, since in the standard approach agent incomes are not calibrated, model-generated time series are bound to diverge, even if prices are initialized as in the data.With our approach, as we repeatedly estimate incomes, we can make model-generated time series track empirical ones.This makes it feasible to forecast future prices.
Estimating agent-specific (or "micro") variables Z is instead not usually considered.We think this is the main obstacle to bringing ABMs closer to data and potentially using them as a forecasting tool.Indeed, if modelers do not correctly initialize latent micro variables and update their values as the simulation progresses, and if the dynamics of observed variables crucially depend on the initialization of latent variables, then model-generated time series are bound to diverge from empirical ones.Additionally, this mismatch implies that model-generated and empirical time series cannot be directly compared to produce a "goodness-of-fit" measure, so one must resort to summary statistics or "stylized facts" to calibrate parameters.The ABM community has recently started to explore data assimilation methods to estimate the latent variables of ABMs [9,4,10,11]; we explain the relation of this literature to our work in the Discussion.This paper proposes a general methodology for estimating latent variables in ABMs.Our approach proceeds in three steps: 1. Given an ABM of reference, translate it into a probabilistic model, by simplifying it until a computationally tractable likelihood of the data given the latent variables can be specified.
2. Estimate latent variables at each time step while keeping all past input values fixed (as in online learning).Solidly rooted in probability theory, our approach maximizes the likelihood at each step via expectation-maximization [12] and gradient descent.
3. Repeat this process over multiple epochs, so that temporal dependencies can be appropriately taken into account.
We showcase our approach by applying it to a housing market ABM specifically designed to study income segregation [13] (Figure 1).We use the ABM to generate synthetic data traces that we can use as ground truth (we would not have access to a ground truth for latent variables if we used real-world data traces).We distinguish between observable and latent variables based on how easy it usually is to access the relevant real-world data.The main latent variable in this model is the distribution of agents' incomes in each neighborhood, which is often not available.We instead assume that we observe neighborhood-level mean prices and the number of transactions over time (these data are usually readily available, see e.g.[14]).We believe this distinction to reflect a common realworld setting in economic models: one might have access to which actions are being performed, but not to the latent state of agents (e.g., where they live, and what is their income).We write the likelihood of prices and of the number of transactions as a function of the household income distribution.Next, we maximize this likelihood, thus estimating the time evolution of the spatial income distribution.
In synthetic experiments, we show that our procedure enables learning latent variables accurately: the Pearson correlation between the ground truth and learned traces ranges between 0.5 and 0.9, depending on the latent variable considered.At the same time, we show that an accurate estimation of latent variables empowers out-of-sample forecasting.Compared to other benchmarks that use rules of thumb to initialize the model at the beginning of the forecasting window, our procedure obtains lower Root Mean Squared Error (RMSE) with respect to the ground truth while being more principled.It also highlights potential identification problems, i.e., situations wherein multiple configurations of micro-variables correspond to the global maximum of the likelihood, so that the ground truth configuration cannot be identified.

Model
We start from the housing market ABM presented by Pangallo et al. [13]-henceforth, the "original ABM".Our goal is to modify the original ABM until it is possible to write a computationally tractable likelihood of observed variables given latent variables and parameters.If we are able to do so, we say that the modified model is a learnable ABM.While writing a tractable likelihood function, we need to preserve the general behavior of the model, as well as its essential causality mechanisms.
In this section, we first give an overview of the original ABM; then, we describe the learnable ABM resulting from our 'translation' process.Along the way, we highlight the specific transformations needed to make the original ABM learnable.A more detailed explanation of the equations describing our learnable model is given in Materials and Methods and summarized in Table 1. Figure 5 reports the causal links between variables as a graphical model.Supplementary Section S1 provides more details on the models and the trasnlation process.
Original ABM [13].The ABM describes the housing market of a city composed of L locations or neighborhoods, each with a number of indistinguishable homes, inhabited by agents.Each agent belongs to an income class k, out of K income classes, each characterized by an income Y k .At each time step, individual agents-represented as discrete units-choose a neighborhood to purchase a home if they act as buyers, or put their home on sale if they act as sellers.One fundamental insight encapsulated in this model is the formalization of the attractiveness of each neighborhood, which regulates how likely an agent is to bid for that location.The model assumes that the higher the income of residents, the more attractive a neighborhood is.In this original model, matching between individual buyers searching in a neighborhood and sellers in the same neighborhood is modeled as a continuous double auction.This process selects buyers and sellers sequentially at random, puts buyers in a queue ordered from highest to lowest bid price (and sellers from lowest to highest ask price), and, whenever a seller asks a price below the maximum bid price in the queue, matches the buyer with highest bid price to the seller with the lowest ask price.The social composition of the city evolves as a byproduct of these transactions, as high-income buyers may replace low-income sellers and lead to the gentrification of some neighborhoods.We report the pseudocode of this original model in Algorithm S1.
Learnable ABM.In order to translate such ABM into a learnable model, we first rewrite it in terms of 'counts', i.e., instead of having variables for each individual agent, with a small loss of generality we consider the aggregated information about the number of identical agents of each income class in each location.This way, we obtain a model that revolves around the state variable M t : at each time step, M t is a matrix of L × K entries, where M t,x,k represents the number of agents of income class k in location x at time t.Similarly, the number of agents of class k buying a house in location x is represented by D B t,x,k , giving a total of D t,x = k D B t,x,k transactions.D t,x is in turn determined as the short side of the market, i.e., the minimum between the number of sellers and the number of buyers in each case.While these two numbers in the original model were stochastic, in our learnable model we use a mean field approximation, and replace the stochastic realizations with their expected value.The final key variable is P t,x , which represents the average price of transactions that occur in location x at time t.
Matching protocol.The matching protocol between buyers and sellers clearly exemplifies the type of transformations needed for our purpose.The continuous double auction of the original ABM is indeed hard to translate into a computationally tractable likelihood.First, we assume that we do not have detailed information on buyers and sellers for individual transactions, so estimating, e.g., the stochastic sequence in which buyers and sellers enter the queue is not feasible.Second, picking the buyer with highest proposed price is equivalent to an argmax operation.Such operation is not differentiable, thus causing the whole likelihood to be not differentiable.Indeed, estimating its outcome would require enumerating all possible cases.To solve both issues, while preserving the properties of the model, we replace the continuous double auction by a multinomial distribution that gives higher probability of matching to buyers proposing higher prices (Equation M8 in Table 1).This rule is differentiable and can be estimated from observed prices: higher prices indicate that richer agents have settled in the neighborhood.

Algorithms
Once we have translated the original ABM into its learnable counterpart, we design an algorithm that infers latent variables by maximizing the likelihood of these variables with respect to observed data and model's assumptions.
To start with, we need to determine which variables are observed and which are latent.To do so, we think of aggregate information about transactions as the only observable at our disposal.In particular, we assume to know, for each neighborhood and over time, the number of transactions D t and the average price P t .Our key latent variable is instead M t , the distribution of agents of each income class across neighborhoods.We believe this distinction to reflect a common real-world setting in economic models: one might have access to which actions are being performed, but not to the latent state of agents (e.g., where they live, and what is their income).As a matter of fact, in many countries it is relatively easy to obtain spatially granular data on transactions, but it is much harder to obtain such data on incomes [14].
Note that M t can be computed deterministically given M t−1 and D B t−1 .Therefore, our problem reduces to finding an estimate for the latent stochastic variable D B t , over all time steps t = 1, . . ., T , and for the starting condition M 0 , given P t and D t : all the other variables are in fact deterministic, and their value is fixed given the formers.This scenario corresponds to the graphical model shown in Figure 5.
However, the number of possible states of D B grows exponentially with the total number of time steps T : evaluating all possible paths of agents over all time steps would be unfeasible even for small values of T .Therefore, we approach our problem as an online task [15], a common technique in machine learning in cases where processing the entire data set at once is unfeasible.We process the data per time step: at each t, the algorithm is presented with the newly observed values D t and P t , and it updates its estimate of the latent variables M 0 and D B t , while considering all the values previously estimated as fixed.After the given time step t has been processed, the algorithm is applied on t + 1, and so on until the last time step T .This process-examining each time step from t = 0 to T -is iterated for a number of epochs: after the last time step T has been processed, the algorithm re-starts from the beginning, so that the first time steps are re-evaluated in light of successive ones.
To solve this optimization problem, we propose an expectation-maximization algorithm.Such an algorithm is able to obtain a maximum-likelihood estimate of the latent variables by optimizing the complete-data likelihood of the model.We outline its derivation in Materials & Methods Section B. It operates by repeating at each given time step t the following process.First, it evaluates the likelihood of each possible behavior of the agents-i.e,the possible outcomes of D B t ; then, it uses back-propagation and online gradient descent to find the best M 0 under this likelihood.These two steps are alternated until convergence.This way, at each time step it recovers the most likely value for D B t , and it updates its estimate for M 0 .All the other variables of the ABM are obtained deterministically from these ones.
In order for the algorithm described thus far to be scalable, we need to solve one last computational challenge: even in a single time step t, the space of possible outcomes of each D B t is huge since in principle one should consider the decisions of all individual buyers as independent.We solve this problem by considering that, while in agent-based modeling it is common to model the behavior of individual agents, for our purposes is sufficient to evaluate the chances of groups of identical agents moving to one location or another: the behavior of a single agent is irrelevant with respect to the data we observe.Therefore, instead of considering all the possible outcomes of each D B t , we consider only those set apart by at least s agent, where s is a learning hyper-parameter.

Experiments
In order to evaluate the efficacy of our approach, we perform two sets of experiments.First, we assess its fidelity, i.e., how well our method recovers latent variables.To do so, we generate a synthetic dataset from the original ABM as ground truth, and feed the observable part of such data to our likelihoodmaximization algorithm.Second, we show that learning latent variables allows us to produce more accurate out-of-sample forecasts compared to existing heuristics.

Recovering latent variables
We consider the time series of the price P t,x and of the number of transactions D t,x at each location x to be observable.We also assume that the macroparameters that generated the data set (e.g., the total number of agents per location N , or the global income distribution) are known.Two other time series, namely that of inhabitants M t,x,k and buyers D B t,x,k , for all locations x and incomes k, are considered latent: they are hidden from the algorithm and used as a validation for what the algorithm learns.
We use the original ABM to generate 20 data traces with L = 5 locations, K = 3 income classes, and T = 20 time steps that we use as ground truth.Each data trace differs from the others in the random initialization of M 0 .We use the first 10 traces as training set to tune the hyperparameters of the algorithm (see Supplementary Figure S6).Then, on the remaining 10 traces that we hold out as test set, we evaluate the performance of the algorithm by computing the Pearson correlation coefficients between learned time series and ground truth ones.Note that our learning algorithm uses the data from the learnable ABM to specify the likelihood, so there is some misspecification compared to the original ABM used to generate the ground truth.For completeness, we also repeat the same evaluation by using the learnable ABM to generate ground-truth data traces, thereby removing misspecification.We view this latter test as a sanity check for the algorithm.Figure 2 shows the results for the test sets.As expected, our ability to reconstruct latent variables is higher for traces generated with the learnable ABM, as there is no misspecification.Perhaps more interestingly, our algorithm reconstructs the time series of buyers D B t,x,k with higher fidelity than the time series of inhabitants M t,x,k (mean correlation ρ = 0.86 vs. ρ = 0.52 with traces from the original ABM).Even though M t,x,k proves to be harder to reconstruct, we still obtain an informative estimate, that further improves when misspecification is removed (ρ = 0.79).We conjecture that this difference in results may due to the fact that D B t,x,k is a "flow" variable that does not depend explicitly on previous time steps, while M t,x,k is a "stock" variable that depends on the whole history, so errors in estimating M t=0 accumulate at following time steps.These results also hint at an identification problem in the original ABM, which we elaborate on further at the end of this section.Regarding the observable variables, our algorithm fits the prices P t,x almost perfectly (ρ = 0.99), and the number of transactions D t,x very well (ρ = 0.85).Without misspecification, the fit for the number of transactions is perfect (ρ = 1.0).While such a good fit for observable variables is expected, since our inference method works by minimizing the distance from observable variables, this result indicates that there is no major misspecification introduced by using the learnable ABM to infer latent variables from traces of the original ABM.In other words, our translation does not alter the nature of the original model.
Figure 3 zooms in on a representative trace generated by the original ABM.For fairness, we choose this experiment as the median one in terms of performance (i.e., correlation between the ground truth and estimated values of M ).These time series confirm the intuition from Figure 2: our approach is able to reconstruct P t and D t extremely well and it is also quite precise at reconstructing D B t .Our estimate of M t is also very accurate in most cases, but imprecise estimates of the initial conditions M 0 lead in a few cases (for instance, in location x = 3) to an inaccurate reconstruction.In a few cases, in fact, the algorithm finds a local minimum that does not correspond to the ground truth.
One of the possible reasons behind this behavior is the presence of an identification problem.We show in fact that, in some cases, the likelihood of the observed data is the same for different possible values of the latent variable M 0 .While these possible values include the ground truth (or, in case of misspecification, values extremely close to it) the model does not have enough information to distinguish it from the other possible optimal values of M 0 .This phenomenon is intrinsic to the ABM under study, once we identify P and D as observable and M 0 as latent.We provide a concrete example in Supplementary Section S2.3.Figure S10 shows a representation of the likelihood able to efficiently visualize such issues.Our approach allows in fact to formally define and thus diagnose such issues.Of course, one could also do this by sampling from the parameter space and computing summary statistics, as with Approximate Bayesian Computation (ABC) calibration methods (see, e.g., Ref. [16]).Our approach, which features a closed-form of the likelihood, has three advantages over these methods: (i) higher accuracy, as we do not have sampling error; (ii) higher efficiency, as we do not need to repeatedly execute the model; and (iii) the possibility to look for local minima using gradient-based methods.

Out-of-sample forecasting
Except for a few recent attempts [17,5,18], so far the use of ABMs for forecasting has been limited.A key problem is that ABM state variables are mostly latent, as it is often hard to observe information that describes individual agents.To the extent that the aggregate dynamics depend on the agent states, a wrong initialization of the latent state variables is likely to lead to a very inaccurate forecast.In this section, we explicitly test whether this is true for our model by using synthetic experiments.To shift our focus away from misspecification errors, we use the learnable ABM to generate the ground truth.We extend each of the 10 test traces for 5 additional time steps, so that the total length of each simulation becomes T = 25.To perform the forecast, we initialize the learnable ABM at t = 20 with a given estimate of the latent state variables M T =20 , and let it produce the time series P t and D t for the out-of-sample time steps t ∈ [21,25].We compare five approaches for the estimate of the latent variable M T : 1. Ground truth: we use the true value of M T generated by the ABM, which we assume to be unobservable.Because of inherent stochasticity, the forecast error is not zero.However, this method represents a lower bound on the forecast error.
2. Random: we draw M T from a Dirichlet distribution whose parameters are consistent with the share of buyers Γ k .A random initialization of latent variables is very common in ABMs, for instance in epidemiological ABMs it is common to choose infected seeds at random [19].
3. Proportional : we draw M T in a way that locations with price higher than the mean price over the city have a higher share of high-income inhabitants.The strength of this relation is governed by a hyperparameter that we calibrate in-sample on the same 10 traces that we use to select the hyperparameters of the learning algorithm.
4. Time series: we run 1000 simulations starting from different values of M 0 and select the M T corresponding to the simulation with the lowest RMSE with respect to the observable time series P t and D t in sample, i.e., for t ∈ [1,20].This is, for instance, the method used by Geanakoplos et al. [20].
5. Learnable: we infer M T with our algorithm, by using the estimates obtained as specified in the previous section.
To evaluate the quality of the forecasts obtained by these approaches, we compute the Root Mean Squared Error (RMSE) for the observable time series P t and D t , summing the errors from time step t = 21 up to t = T = 25.Figure 4 shows the results.We do not report the values for the Random approach as it has RMSE=12, well above that of the other approaches.Our Learnable approach substantially improves over the Proportional and Time series approaches, and essentially on par with the ground truth benchmark.This is a strong improvement, but we believe that the value of our approach goes beyond this aspect.In fact, alternative approaches are heuristics, that do not yield much insights about inference.By contrast, our approach is more principled: it frames the problem of estimating unobservable variables of an ABM into probabilistic inference.This methodology opens new Figure 4: Forecasting error for our method compared to alternative benchmarks.Forecasting error is measured as the sum of the RMSE on the Pt and Dt time series.We consider the same 10 traces as in the experiments above, and show results for each trace as a dot and a whisker plot as a summary.Whiskers extend from the minimum to the maximum value, while boxes range from the 25th to the 75th percentile.
research directions to further improve our results.For instance, designing learnable ABMs from the start, for which there would be no misspecification error.Even more importantly, it makes it possible to formally reason about the likelihood of an ABM-for instance, to spot potential identification problems.

Discussion
Now that we have shown the advantages of our method in terms of predictive capabilities, we discuss its general applicability.From the specific translation of the housing market ABM considered in this paper, we can identify some general design principles, that we believe will be useful in making other ABMs learnable.
First of all, it was necessary to tune the level of stochasticity of the model by considering which variables are observed and which are latent.In most graphical models, latent variables are stochastic random variables that are related to the observable onesindeed, if they were deterministic, they could be computed exactly [21].All stochastic variables that are not observed must be estimated, thus increasing both the computational complexity of the process and the uncertainty of the model.However, in our translation, we have room to decide which variables are deterministic and which are stochastic.To make the model truly learnable, we need to balance observable and latent variables so that for every latent variable we have some observable that intuitively makes it possible to estimate it.We can encapsulate this first design principle as follows.
Principle 1. Stochasticity parsimony.In a learnable ABM, the amount of stochasticity should be commensurate to data availability.
Second, we needed to carefully consider which variables and functional forms should be discrete.ABMs often consist of discrete units, and it is common for agents to choose between different discrete possibilities.However, discreteness makes likelihood optimization problematic.Indeed, whenever we deal with discrete variables, the likelihood must consider all possible combinations of values for discrete stochastic variables, which greatly increases the computational burden of the approach.Moreover, in some cases the likelihood may be flat over some region of the latent space, thus hindering the progress of optimization algorithms such as gradient descent.Given these considerations, it is important to limit the use of discrete variables to the ones that are critical to the behavior of the model.Principle 2. Differentiability preference.A learnable ABM should prefer continuously differentiable functions over discrete choices when they do not alter the behavior of the model.
Following these principles, it should be possible to transform any ABM into a learnable one, given enough data.While the translation in this paper was still hand-curated, it is a first step towards its proper formalization, and thus automatization.
Nevertheless, different alternative methods have been suggested in the literature to obtain similar results.Making the state of the system compatible with real-world observations has traditionally been the goal of data assimilation techniques, such as the various versions of the Kalman filter or the particle filter.Originally developed in meteorology and geosciences [22], data assimilation techniques have recently been employed in ABM research [9,4,10,11].These works treat the ABM as a black box, adjusting ABM state variables so that forecasts come closer to observations.The main advantage of data assimilation techniques over our approach is that they do not require building a new model (the learnable ABM).At the same time, our approach offers several advantages.(i) It deals with the estimation of discrete variables in a natural and principled way.Standard data assimilation methods only allow to tune continuous variables [9,4], and recent attempts to deal with discrete variables [10,11] tend to be heuristic and problem-specific.(ii) Its closed-form likelihood can be maximized with computationally efficient gradient-based methods, by leveraging deep learning frameworks/architectures.(iii) Such closed-form likelihood is also an essential tool to analyze identification problems, thus offering explanations about the estimated variables.(iv) While Kalman filters require Gaussian or quasi-Gaussian noise, and linear or weakly non-linear functional forms, our approach can easily integrate most types of stochastic element and non-linearities.Considering these advantages, we believe that likelihood-based estimation of ABM micro-states is a promising direction to obtain more principled approaches to data-driven ABMs.

Conclusions
In this work, we have shown how to translate a complex agent-based model into a probabilistic graphical model to obtain a learnable ABM.For this type of model, methods such as maximum likelihood estimation can be used to estimate latent micro-state variables of the agents coherently with both the model and with provided data.Then, we proposed an expectation-maximization algorithm for the resulting learnable ABM in order to estimate the latent variables given the observed ones.We have shown that this process is indeed able to recover unobserved variables that are in line with the learnable model, as well as with the original one, under a variety of settings.This way, we can feed such learned variables to the ABM, and obtain an evolution of its micro-states that is in line with the provided data.This procedure empowers the ABM to be used as a forecasting tool.
Building a fine-grained link between an ABM and observed data opens the way for different exciting opportunities.As we have shown in this work, it allows in the first place for better usage of ABMs as instruments for prediction.Initializing agents' micro-states in a way that is coherent with observed data means that their future trajectory can be regarded as the best compromise between the theoretical model assumptions and the available observations.Therefore, the quality of the predictions can also be used as a direct validation (or falsification) of the causal model embodied in the ABM.Besides these immediate advantages, however, more advanced possibilities are opened.For instance, defining the likelihood of the model w.r.t. the observations allows to perform model selection by using available data.In other words, it allows using ABMs to formulate hypotheses and test them against real data.While this technique has been shown to work properly in simple cases [15], its application to more complex ABMs requires further analysis and paves the way for novel research directions.Furthermore, the translation of an ABM into a probabilistic model forces the modeler to lay bare their assumptions, and to consider the inferential problem.This way, it brings forth possible identification problems: when different models (or realizations of the same model) lead to the same observable state, how can we choose one in practice?Such a problem, often ignored in ABM research, will be vital to consider in further applications of ABMs to real-world data.
Our approach for learnable ABMs stems from the general framework of probabilistic graphical models [23].While their application to ABMs opens possibilities for interdisciplinary cross-pollination, it also poses new theoretical challenges.Because of the complexity of ABMs, many methods commonly applied such as Markov Chain Monte Carlo (MCMC) [23] become computationally unfeasible.Because ABMs aim to model emergent behavior through the combination of many simple rules, they often involve long chains of dependences among variables, often with highly non-linear behavior.Hence, many desirable theoretical properties necessary for the convergence of MCMC might be missing, such as the uniqueness of the posterior distribution [24].More in general, such a distribution is often very complex and highdimensional, and difficult to learn through sampling techniques.Thus, we choose to maximize the likelihood by leveraging gradient descent and automatic differentiation [25].Interestingly, because of the long sequences of deterministic transformations typically found in ABMs, our optimization task ends up resembling deep learning ones.However, while the transformations in deep learning are purely data-driven-i.e., transformations aim only at maximizing prediction accuracy-our methodology still places an emphasis on causal mechanisms: each transformation represents an aspect of the theory being modeled.

A Model description
Here, we give a minimal description of the learnable model.In Supplementary Section S1.2 we provide a longer description as well as a detailed interpretation of each modeling assumption.
The model represents the housing market of a city with L locations or neighborhoods denoted by x = 1, . . ., L, each with N indistinguishable homes, inhabited by agents that are only distinguished by their income class k = 1, . . ., K. The vector of state variables Zt is composed by the variables {M t,x,k }, {Pt,x}, {Rt,x}, where M t,x,k is the number of agents of income k living in location x at time t; Pt,x is the average price of location x at time t; and Rt,x is the inventory of unsold homes at location x at time t.These state variables are updated according to deterministic and stochastic equations that represent the demand and supply sides of the housing market, and the matching between potential buyers and sellers.The causal links between these variables are summarized in Figure 5.All the equations of the model, Equations (M1) to (M13), are listed in Table 1.
Equations (M1) to (M3) characterize the number of buyers from each income class that try to buy a house at each location at time t.Buyers prefer to live in locations with higher attractiveness At,x, which depends on a constant local intrinsic attractiveness A I x and on the time-varying average income at that location, captured through Y k -the income of agents in income class k (M1).However, locations with high attractiveness may also be more expensive, so the probability π t,x,k for a buyer of income class k to choose location x also depends on the difference between their possibility to pay-here exemplified by income Y k -and price Pt,x (M2).Finally, the number of potential buyers of income class k for location x at time t, N B t,x,k , is given by simply assuming that, for each income class k, a fraction Γ k of the total buyers Q distribute themselves among all locations following probabilities π t,x,k (M3).
Next, Equations (M4) to (M5) characterize the supply side of the market.The number of sellers N S t,x is given by the inventory of houses on sale at the previous time, R t−1,x , plus a fraction α of the houses that were not on sale (M4).Moreover, the minimum price that the sellers at location x are willing to accept, P S t,x , is a smooth function of the ratio between the number of buyers and sellers at x: when there are more buyers than sellers, sellers refuse to sell at a price below P t−1,x ; conversely, when there are more sellers, they are willing to sell at a discount, up to a price that is 1 − δ of P t−1,x (M5).
The demand and supply sides of the market are matched in Equations (M6) to (M11).The number of deals or transactions Dt,x is the short side of the market, i.e., the minimum between the number of buyers and sellers (M6).When there are more buyers than sellers, only some buyers are able to secure a deal.The probability that agents from income class k secure a deal at x is represented by π D t,x,k , which is proportional to the number of buyers from class k and to their income (M7).The number of buyers of each income class who secure a deal, D B t,x,k , is given by the Dt,x realizations of a multinomial with parameter π D t,x,k (M8).In this way, the outcome of this random variable has to be consistent with Dt: the total number of buyers in each location x is fixed to Dt,x; for each location, the buyers are distributed among income classes according to π D t,x .The number of sellers from class k who manage to sell at location x (D S t,x,k ) is instead simply proportional to the fraction of kagents living in location x (M9).With Equations (M6) to (M9)

B Algorithm derivation
Here, we provide a more detailed description of our algorithm.Following our online assumption, its goal is to estimate latent variables at time t by looking at observables at the same time step, and treating all the previously estimated variables as fixed.Specifically, D B 0 , . . ., D B t−1 , i.e. the buyers who previously relocated, and the corresponding sellers D B 0 , . . ., D B t−1 , are fixed.Therefore, the algorithm observes Pt and Dt, and then it provides a new estimate of D B t , and update the estimate of M 0 .In fact, since previous D B and D S are fixed, Mt is a deterministic function of the latent variable M 0 allowing us to treat M 0 , and not Mt, as our latent variable.
Since our observed variables are in principle also a deterministic result of the others, we model their observed value as a noisy proxy of the value determined by the agent-based model rules.Specifically, for prices we assume that we observe Pt, given by Pt = Pt + P , where the error P is normally distributed and Pt is the deterministic estimate of prices as computed by the model (see (M11)).Similarly, the number of observed deals Dt will follow Dt = Dt + D .Now, computing the likelihood of the observed prices Pt requires knowledge of the latent variable D B t , that is, the distribution of buyers among classes and locations, which is a discrete outcome of a stochastic process dependent on our main latent variable M 0 .Therefore we resort to the (Generalized) Expectation Maximization algorithm.In this way, we alternate between evaluating the expectation of D B t and updating our estimate of M 0 under the current estimate of expectation.The Table 1: Equations defining our agent-based model.
latter can be performed with online gradient descent, sinceonce we fixed the probability of each possible outcome of D B twhat remains of the likelihood is a continuous and differentiable function of M 0 .
First, observe that Pt and Dt are independent given M 0 .In fact, Dt is fixed given M 0 ; the distribution of D B is also fixed, since it is determined from Dt and π D ; and the error D is drawn independently from the extraction of D B from such distribution and from P .Therefore we can factorize the complete-data likelihood w.r.t.observed data D = { Pt, Dt} as: Since computing P( Pt|M0) = D b P(D b |M 0 )P( Pt|Db, M 0 ) without the knowledge of the latent variable D B t would be unfeasible, we resort to the Generalized Expectation Maximization algorithm, alternating these two steps until convergence: 1. First, we evaluate the expectation of D B t given the rest of the variables.Given the set Ω of all possible values of D B t , for each D B t ∈ Ω we evaluate q(D B t ) := P(D B t |M * 0 ) where M * 0 is the current estimate of M 0 .This probability is computed from (M8).
2. Then, we update the estimate of M 0 in order to increase the likelihood from Equation (2), by increasing the auxiliary function Noting that Dt does not depend on D B t (see Figure 5), the last probability can be decomposed as log P( Pt, Dt|D (4) These two elements are given by the gaussian distribution of the errors P , D ∼ N (0, σ) (σ being a hyper-parameter), between Pt and Pt, and Dt and Dt respectively.Note that P and D are a deterministic function of the latent variable M 0 we are optimizing, through the chain of deterministic equations of the ABM (Table 1).The only free variable is in fact M 0 , since the previous variables from time steps t < t are assumed to be fixed, and the value of D B is known for the assumption of EM.Therefore, since all these deterministic functions are continuous and differentiable in the general case, it is easy to update M * 0 ascending the gradient ∇ M * 0 Q(M * 0 ).The complexity of computing this gradient is left to differentiable programming frameworks.
Nevertheless, this approach presents a problem: the set Ω of possible values for D B t , the matrix of numbers of actual buyers for each class and each location, is potentially huge (precisely, n+k−1 n with n buyers and k classes).We solve this problem with two considerations: first, we will show that Pt,x ⊥ ⊥ Pt,y|Dt, π D t ; second, we are not interested in all the possible values for the number of actual buyers: the behavior of a single agent is irrelevant with respect to the data we observe.Let us analyze these two key points.
The first consideration stems from the the independence of outcome in different neighborhoods: D B t,x ⊥ ⊥ D B t,y |Dt, π D t .This fact follows naturally from (M8), since all locations are independently drawn.As a consequence, also the probability of observed prices Pt,x and Pt,y are independent from each other for any two locations x = y, since (M10) and (M11) do not have any inter-location effect, and the observation noise P is also independent across locations.Therefore Pt,x ⊥ ⊥ Pt,y|Dt, π D t and we can write: Thus, we can factorize Equation (3) in a more practical way.Let us call Ωx the set of all possible values for D B t,x , given a location x.Then, our algorithm becomes an iteration of the following two steps until convergence.
1. Evaluate ∀x ∈ {1, . . ., L} and ∀D B t,x ∈ Ωx: Note that any two values in Ωx are mutually exclusive, so D B t,x ∈Ωx q(D B t,x ) = 1 holds for all x.
2. Update M * 0 by ascending the gradient because of Equations 4 and 5.
To define the set Ωx we take advantage of the second key point: two different values of D B t,x might be indistinguishable in practice given our data, if they differ only by a few agents.Thus, instead of considering all the possible partitions of the integer Dx,t in K positive integers, we only consider their quotient for a given constant s (i.e., Dx,t/s ): this can be thought of as the possible outcomes obtained by moving groups of s agents at a time.Any difference below s is considered negligible.In practice, we set s as a consequence of the available memory.Given a maximum size for |Ω|, we set a threshold for each |Ωx| proportional to its original size n+k−1 n .Here, we keep in consideration the effective number of classes k ≤ K that can afford a location, since both P t−1 and Y are assumed to be known at time t.After setting this threshold, we find the minimum s s.t.|Ωx| respects such threshold.
Pangallo et al. [13] introduce an ABM that describes the housing market of a city.It is beyond the scope of this section to fully repeat the description of the model and to justify each assumption, so we give a brief overview and report the pseudocode of the model (see Algorithm S1 and Algorithm S2).
The city has N • L inhabitants (see Table S2 for a summary of the notation used), with Q buyers coming to the city every time step to purchase a home, divided between γ k buyers belonging to income class k, k = 1, . . .K, such that Q = k γ k .Each inhabitant i is characterized by four state variables that can change over time t: state s t,i (buyer, housed, seller), reservation price P R t,i , location x t,i , and a categorical income Y t,i , that belongs to one of the K income classes.The city is composed of L neighborhoods or locations x that are distinguished by their intrinsic attractiveness A I x , social attractiveness A S t,x (which depends on which agent inhabit the neighborhood), and market price P t,x .
The model is initialized following some protocol to locate agents with different income in the city (e.g., uniformly at random, or following some predefined spatial distribution).After that, at each time step t some buyers come to the city to purchase a home, some housed agents decide to leave and sell their home, and buyers and sellers at each location are matched via a continuous double auction.
Algorithm S1 details the operations that occur at each time step.First, the model updates some locationspecific variables which reflect the change in social composition that occurred in the previous time step (lines 1 to 9).These include updating the average income and social attractiveness at each location and then computing the utility for agents of a given income class at a given location.Next, buyers choose a location where they try to purchase a home (lines 10 to 22).After that, housed agents may put their home on sale with probability α and set a reservation price by applying a markup µ to the market price of the location where they live (lines 23 to 29).Agents that decided to sell their home in previous time steps but were unsuccessful reduce their price by a factor λ every τ time steps (lines 30 to 34).Finally, buyers and sellers are matched at each location via a continuous double auction, and successful buyers replace successful sellers (lines 35 to 37).
Algorithm S2 details the continuous double auction process.At each location, there is a set of buyers B t,x and a set of sellers S t,x (lines 1 to 2).If there are in fact either no buyers, or no sellers, or all reservation prices of the buyers are lower than the reservation prices of the sellers, no transaction takes place, and the market price does not get updated (lines 3 to 4).If instead at least one transaction can occur (lines 5 to 23), the following process takes place.First, one creates auxiliary lists of buyers and sellers (also known as logs), O B t,x and O S t,x respectively, and fills them as agents are drawn uniformly at random from the common pool of buyers and sellers.Every time that a buyer with a higher reservation price than a seller enters O B t,x (or a seller with a lower reservation price than a buyer enters O S t,x ), the buyer with highest reservation price is matched with the seller with lowest reservation price.The price of the transaction is the weighted mean of the respective reservation prices, with weight given by a parameter ν capturing bargaining power, and this individual transaction price is added to the list P t,x .Finally, the seller leaves the city and the buyer settles where the buyer was.The market price is computed as the mean of the transaction prices. : end for 9: end for 10: for x = 1, . . ., L do 11: for k = 1, . . .K do 12: Probability that k-buyers choose location x 13: end for 14: end for 15: for k = 1, . . ., K do Create buyers and let them choose a location 16: Pt,x = mean(Pt,x) 23: end if

S1.2 Detailed description of the learnable ABM
This section gives a more detailed description of the learnable ABM than Section A in Materials & Methods, and details the interpretation for each equation of the model.Table S3 can be used as a reference for notation throughout this section (although some notation overlaps with that of Table S2, there are a few differences and so we prefer to present the two tables as separate).

S1.2.1 General set-up
Agents are divided into K income classes, each characterized by income Y k , k = 1, . . ., K. All agents within the same income class, also named k-agents, are assumed to be identical and indistinguishable.The city is composed of L locations denoted by x.

S1.2.2 Demand
Let M t,x,k be the number of inhabitants of class k living at location x at time t.As shown in Table S3, this number is a real rather than an integer.We make this choice for computational reasons but, as we typically deal with large values of M , it does not substantially affect our results.We assume that each location x is characterized by an attractiveness A t,x that can change over time, given by In the equation above, the first term A I x is an intrinsic attractiveness that is fixed over the simulation.It captures relatively permanent city features, such as amenities, schools, and public transport.The other term captures an attractiveness towards wealthier neighborhoods that can vary in time.It is defined by the mean, one-period lagged, income at location Thus, location x whose mean income is higher than average, i.e., Y t−1,x > Y t−1 , is, ceteris paribus, more attractive than locations whose mean income is lower than average.
In their decision to move to location x, agents in income class k also take into account the affordability of location x, i.e., the difference between their willingness to pay, here simply captured by their income Y k ,1 and the average price at t − 1, P t−1,x .The utility of k-agents for location x is then given by2 where β ∈ (0, 1) gives the relative weight of attractiveness relative to affordability.When β is close to 1, agents care little about affordability, while when β is close to 0 the opposite holds.If location x is unaffordable (Y k ≤ P t−1,x ) then V t,x,k = 0. Buyers in income class k are willing to bid up to their income Y k , i.e., their reservation price P B t,x,k is equal to Y k .Summing up, k-agents looking to buy a house in the city evaluate a utility V t,x,k for all locations x.They then choose a location x where they try to buy a house with probability π t,x,k proportional to V t,x,k , i.e.
Table S3: Notation for the learnable ABM.

Symbol Set Meaning
Reservation price for sellers Dt N L Number of transactions (Deals) that actually take place Probability that an agent is selected among the buyers to conclude a deal

Number of potential sellers that complete a transaction
We first indicate the parameters K and L that determine the size of the variables, next we indicate model-wide parameters (i.e., scalar quantities that are fixed in time) and locationor income class-specific parameters (size L or size K), and finally variables that can be location-specific (size L) or location-income-specific (size L × K).We further show which quantities are constrained to the unit interval [0, 1].
We assume that a total of Q agents come to the city at each time step looking to buy a house, and that a share Γ k of these agents is in income class k.The number of buyers of income class k at location x at time t, N B t,x,k is given by i.e., it is the expected value of a multinomial with QΓ k trials and a probability vector given by the L values of π t,x,k , for all locations x ((M3)).

S1.2.3 Supply
In each location x, the other side of the market is composed by sellers.We do not distinguish the income class of potential sellers, in the sense that we just keep track of the total number of agents willing to sell their house at location x at time t, N S t,x .The total number of sellers is obtained by summing the number of agents who wanted to sell at the previous time steps but did not succeed, R t−1,x , and the number of agents who decide to put their house on sale at t.In turn, this number is given by a fixed fraction α of the agents that had not decided to sell before t, which represent the difference between the total number of agents residing at location x, N x , and R t−1,x .In formula, The way sellers determine their reservation price, i.e., the minimum price they are willing to accept, is more sophisticated.Here, we assume that sellers are not willing to accept any price below the average price at the previous time step, P t−1,x , as long as there are more buyers than sellers.This choice captures the idea that, in this situation (known as a "sellers market"), sellers have more bargaining power than buyers.Conversely, when there are more sellers than buyers, sellers compete for the few buyers by being aggressive in reducing their reservation price.So, they are willing to accept offers that can be below P t−1,x .We let x denote the ratio between the number of potential buyers and of potential sellers at location x and time t.We then assume that sellers are willing to accept prices lower than P t−1,x by a fraction δ if there are more sellers than buyers, i.e., φ t,x → 0. Conversely, when there are more buyers than sellers (φ t,x → ∞), sellers are not willing to go below P t−1,x .We interpolate between these extreme values of φ by assuming a hyperbolic tangent functional form, i.e., we assume that the sellers' reservation price P S t,x is given by Here, we assume that sellers decide on their reservation price independently of their income class.

S1.2.4 Matching
At this point, the two sides of the market have been completely characterized, as we know the number of buyers in each income class N B t,x,k , their reservation price P B t,x,k , the number of sellers N S t,x , and their reservation price P S t,x .It remains to be determined how buyers and sellers are matched,t and how this matching impacts future prices and the social composition of neighborhoods.
To start, let D t,x denote the number of deals that occur at location x and time t, i.e., the number of transactions that effectively occur between buyers and sellers.This number is given by the "short side of the market", i.e., by the minimum between the number of potential buyers and potential sellers: In case there are fewer deals than potential buyers, i.e., D t,x < k N B t,x,k , we need to decide which potential buyers are successful in actually buying a house and which are not.
To do so, we assume that demand is satisfied on a pro-rata basis, although correcting the pro-rata assumption by making richer buyers more likely to secure a deal.This assumption captures a bargaining process in a more tractable way than explicitly simulating an auction.Thus, the probability that k-agents are able to secure a deal at location x and time t, denoted as π D t,x,k , is proportional to the number of potential buyers in that class, N B t,x,k , multiplied by the difference between the reservation price of k-buyers and that of sellers, Y k − P S t,x : Then, we compute the number of actual buyers of class k at time t in location x by a multinomial with D t,x trials and a parameter vector of length k given by π D t,x,k : We further compute the number of actual sellers D S t,x,k by assuming that all agents living in location x are equally likely to sell, and so the share of k-agents among the sellers is proportional to the share of k-agents among the inhabitants, M t−1,x,k /N : As above, this simplification ensures tractability, as keeping track of the number of sellers in each class over time would substantially increase the dimensionality of the space of state variables.
Next, after we determine which buyers and which sellers are successful in securing a deal, we need to determine the price of the transactions.First, we compute the average buyer reservation price P B t,x as Then, we assume that the average transaction price is a weighted average between the reservation price of buyers and that of sellers: Here, ν denotes the bargaining power of sellers as, the larger ν, the higher the transaction price will be.3

S1.2.5 Update of state variables
It only remains to update the stocks of inhabitants M t,x,k of each class k in each location x and of unsuccessful sellers R t,x in x.For each class and each location, the number of inhabitants is given by the number of inhabitants on the previous time step, plus the buyers who secured a deal, minus the sellers who were able to sell: The number of unsuccessful sellers is obtained by summing the unsuccessful sellers on the previous time step to the number of sellers at t, subtracting the number of deals:

S1.3 Comparison between the original ABM and the learnable ABM
We need to perform several modifications in order to make the original ABM learnable (see Table S4 for a summary).As a general principle, while in the original ABM the state of the system was described by the variables of individual agents i, in the learnable ABM we only consider counts of how many agents are within each income class.For instance, in the original ABM we keep track of the state s t,i , income Y t,i and location x t,i of each agent i, while in the learnable ABM the variable M t,x,k counts how many agents of income k are either housed or sellers at location x, the variable R t,x,k counts the sellers that were not successful in selling, and so on.This general modification does not lose much information.Indeed, the key heterogeneity that distinguishes agents in the original ABM is which income class they belong to, which is kept in the learnable ABM.We discuss below some examples where considering agent counts makes (or does not make) some difference.The rationale for this general principle is that we assume that we only observe aggregate data at the level of locations, and so we must be parsimonious with unobserved variables.
We now discuss the modifications one by one.First note that the probability for buyers to search for a home in a given location does not change from the original to the learnable ABM.For instance, although the specifications look different, Equation (M1) and lines 1 to 9 in Algorithm S1 are identical.At the same time, Equation (M2) is just a shorthand for lines 10 to 14.
When it comes to choosing a specific location, the learnable ABM essentially assumes the expected value of the stochastic process used in the original ABM.In the original ABM, individual agents belonging to a given income class k select a given location x by drawing from a categorical distribution-a multinomial distribution with one trial (line 20).Because all buyers belonging to the same income class are identical and have the same probability to choose a given location, it is completely equivalent to consider a multinomial distribution with γ k trials.Indeed, this is what the learnable ABM does, except it considers the expected value of this distribution (Equation (M3)).This choice is, once again, to limit the amount of stochasticity: as the model does not observe potential buyers, it would have to estimate the realization of this variable, and this may create computational problems.hyperparameters, so we only explain how we assign some hyperparameter values, and explicitly show the effect of two hyperparameters that we consider particularly important.
• Initial guess for M 0 .The first expectation step of our expectation-maximization algorithm requires an initial guess for M 0 .We take a uniform M 0 (i.e. a situation in which there is the same number of inhabitants of each income class in each location) to which we add some random noise.The realization of this noise is the only difference between runs of the learning algorithm on the same trace (all the other steps are deterministic).We experimented with a few options for the variance of the noise, finding that it did not have substantial impact on the results, and so simply decided to draw the noise from a standard Gaussian pdf in logarithmic space.
• Expectation-Maximization (EM) parameters.The EM algorithm keeps iterating the expectation and maximization steps until some variables converge or up to a maximum number of steps.We use a 5% threshold to decide on convergence, and allow for a maximum of 100 steps.Moreover, in the maximization step the gradient descent algorithm requires a learning rate and a number of learning steps (within each EM cycle).We choose a learning rate of 0.001 and a maximum of 4 learning steps.Overall, in our explorations the results were not particularly sensitive to the EM parameters.
• Number of epochs.In our preliminary experiments we found that going beyond 3 epochs only marginally increased accuracy, and the marginal gain kept decreasing with the number of epochs.In light of this preliminary evidence, we fixed the number of epochs to 5.
• Parameter δ.The parameter δ in Equation (M5) has no counterpart in the original ABM, in which the reservation price was set by individual sellers following an aspiration level heuristic.Thus, one needs to choose a value of δ that makes time series generated by the learnable ABM as similar as possible to time series generated by the original ABM.We experimented with a few values, noting that δ = 0.06 seemed to yield the most similar time series.
• This leaves two hyperparameters whose effect we want to explore.
• Number of D B samples.As detailed in Section B of Materials & Methods, our EM algorithm only considers a subset of all possible values of the latent variable whose likelihood is estimated in the expectation step.If the number of samples is higher, we expect that the performance of the algorithm improves, but this also leads to increased computational cost.We expect that increasing the number of D B samples has a similar effect as increasing the number of epochs.However, because the sampling from all possible values of D B is an original contribution of this paper, it is interesting to explicitly explore how it affects the performance of the algorithm.
• Standard deviation of the noise on observables, σ P and σ D for P and D respectively.As detailed in Section B, P t and D t are deterministic functions of their ancestors, and so in order to obtain a likelihood we model their observed values as a noisy proxy of the deterministic values.How close the observed values must be to the deterministic values is governed by a variance σ: the higher σ, the more observed values can be far from model values.First, note that the scale of σ does not matter: if both σ P and σ D are multiplied by the same number, the log-likelihood only shifts by a constant.What matters is the relative value of σ P with respect to σ D .When σ P is larger than σ D , the learning algorithm should give more importance to D t , while in the opposite case it should give more importance to P t .To study this effect, we set σ P = 1 without loss of generality, and vary σ D .
Figure S6 shows the Pearson correlation coefficient ρ(M, M ) between the ground-truth M and the estimate M in the 10 traces that we use as a training set, for different choices of hyperparameters.We consider three values of the number of D B samples, namely 16, 64, and 256, and three values of σ D , namely 0.01, 1, and 100.number of inhabitants with highest income (M 0,x,k=2 ) at locations x = 0 and x = 1, ensuring that the total number of inhabitants at these locations remains equal to N .To do so, we fix the number of middle income inhabitants (M 0,x,k=1 ) to their ground truth value, and fix the number of low-income inhabitants (M 0,x,k=0 ) to M 0,x,k=0 = N − M 0,x,k=1 − M 0,x,k=2 .These constraints make it possible to vary M 0,x=0,k=2 between 0 and 999 and M 0,x=1,k=2 between 0 and 992.These values of M 0 include the ground truth, which corresponds to M 0,x=0,k=2 = 0 and M 0,x=1,k=2 = 815.(All other components of M 0 are the same as in the ground truth.) Figure S8 shows the loss as a function of these values of M 0 , both taking the learnable ABM and the original ABM as ground truth (for visualization purposes, we only show values of M 0,x=0,k=2 between 0 and 300, all other values lead to a much larger loss).In the case of the learnable ABM (shown on the left) there is no mis-specification, and the loss attains its minimum possible value.Since we take a Gaussian with unit variance and zero mean to model the error on P t and D t , the minimum value of L is 2 when the errors P t − Pt and D t − Dt are zero.As we see from the heatmap, the loss is well-behaved, in the sense that it does not display local minima and the minimum corresponds to the ground truth.
Interestingly, the gradient is much stronger when varying the number of high-income inhabitants at location 0, M 0,x=0,k=2 , than when varying high-income inhabitants at location 1, M 0,x=1,k=2 .This effect can be explained by the initial price (not shown) at location 1 being much higher than the initial price at location 0, so that only the highest income agents can afford location 1 in the first place.Therefore, making location 1 more or less attractive by increasing or decreasing the number of high-income agents inhabiting it does not make much of a difference to the distribution of buyers, compared to changing the number of high-income agents at location 0, which all the population can afford.
The right panel of Figure S8 takes the original ABM as the ground truth.The heatmap is not much different, thus suggesting that there is no major misspecification.However, the global minimum is a corner solution, at 1000 high-income inhabitants at location 1 and 0 high-income inhabitants at location 0, in contrast with the ground truth (the ground truth is the same for the original and the learnable ABM).Thus, while correctly guessing that location 1 has a much higher number of high-income inhabitants than location 0, the algorithm would not yield a perfect estimate.This qualitative result is in line with the quantitative results that we show in Figure 2 of the main paper.Figure S9: Log-likelihood of prices and transactions at all five locations as we vary the number of high-income inhabitants at location 0. Here, the learnable ABM is the ground truth, and the ground-truth number of high-income inhabitants is represented as a vertical grey line.
Next, Figure S9 shows the log-likelihood of prices log P( Pt |M 0 ) and transactions log P( Dt |M 0 ) as a function of the number of high-income inhabitants at location 0, M 0,x=0,k=2 , when taking the ground-truth value of the M 0,x=0,k=2 (and the learnable ABM is the ground truth).Essentially, by taking the negative sum of all the components of the log-likelihood we obtain the loss, corresponding to a horizontal cut through the heatmap in Figure S8 (left) at the vertical coordinate of the ground truth.The advantage of the representation in Figure S9 is that we can understand how each component contributes to the loss, and whether there are some non-linearities that give insights into difficulties to estimate the latent variables.
First, the log-likelihood varies much more with D t than with P t .As an intuitive justification for why this is the case, consider the graphical model in Figure 5.In that graphical representation, P t is influenced by M 0 only through several intermediate steps, while D t is more directly influenced.In particular, changing the attractiveness of location 0 relative to the others changes demand across locations by a large margin, leading to very different values of D t at each location.However, to propagate these differences to prices (at the same time step!) we go through P S t , which only varies by a factor δ = 0.0625 from the previous price, and so is not as sensitive to changes in M 0 .
Second, the maximum of the log-likelihood is correctly achieved at M 0,x=0,k=2 = 0, and then all components of the log-likelihood monotonically decrease with M 0,x=0,k=2 .Focusing on the transactions (which dominate the loss), we see that the component of the likelihood corresponding to location x = 0 is the most affected, the likelihood at locations 2 and 4 is also strongly affected, while the likelihood at 1 and 3 is barely affected.The reason is that buyers at locations 1 and 3 are almost exclusively high-income, whereas locations 2 and 4 have many middle-income buyers (as location 0 does).So, increasing the number of high-income buyers at location 0 strongly decreases the number of middle-income buyers at 2 and 4 (in a sense, these three locations get in competition), but it does not have a strong effect on locations 1 and 3.
Third, it is interesting that the log-likelihood for transactions at location 0 flattens out after a number of high-income inhabitatns M 0,x=0,k=2 = 350.This effect is due to a supply constraint: increasing the number of high-income inhabitants at location 0 substantially increases demand, but the number of sellers is fixed, so, when the number of buyers becomes larger than the number of sellers, the number of transactions remains fixed (Equation (M6)).Increasing the number of buyers still puts upward pressure on prices, and indeed the log-likelihood of prices does not flatten out.As a last example, we consider a simplified setting with L = 2 locations, which allows to visualize a higher portion of the latent variable space than with L = 5 locations.Indeed, with L = 5 locations we have 10 degrees of freedom in M 0 (considering K = 3 and the constraint that the total number of inhabitants is N at each location), but we can only visualize how the loss changes by varying 2 components and keeping all others fixed.Instead, with L = 2 locations we only have 4 degrees of freedom, so varying 2 components of M 0 at a time gives a more complete picture.
For simplicity, we initialize the model with the same parameters as above, focusing on the locations 2 and 4 above.In this case, we have M x=0,k,t=0 = [451, 549, 0] and M x=1,k,t=0 = [614, 386, 0] as ground truth.This time, we vary the number of low-income inhabitants M 0,x,k=1 at both locations, again using the same method as above to ensure that the total number of inhabitants is always N at each location.
The results of this analysis are shown in Figure S10.Differently from Figure S8, here the minimum of the loss is not attained at a single combination of values of M 0 .Instead, all points on a line that crosses the latent variable space from M x=0,0,0 , M x=1,0,0 = [0, 250] to [900, 1000] lead to a value very close to the minimum loss (with only two locations, this value is L = − log This situation constitutes an identification problem: the model is not able to identify the ground truth, and any inference algorithm could converge on any value on the white line Figure S10.We conjecture that similar issues could prevent the learning algorithm from obtaining a perfect estimate for M .Note that this problem is not due to the translation into a learnable form, but intrinsic to the ABM under scrutiny: many possible configurations of agents could lead to the same observable outcome.Our approach allows to formally define and diagnose such issues, thus allowing ABM researchers to take into account the learnability of their model from observed data.

Figure 3 :
Figure3: Estimates for M t,x,k , D B t,x,k , Pt,x, Dt,x compared to the traces generated with the original ABM, in a single experiment, chosen as the median experiment in terms of estimation quality.

Figure S7 :
Figure S7: Estimate of M 0,x,k over time and over multiple epochs, for the same simulation as the one shown in Figure 3.Each panel represents a value of location x and income class k.Each horizontal blue line is the ground truth value of M 0 , each red line is the estimate of M 0 at a particular time step at a particular epoch.Black vertical lines distinguish the 5 epochs on which we run our experiments, within each epoch we estimate M 0 at each of 19 time steps.

Figure S10 :
Figure S10: Heatmap of the loss (i.e., negative log likelihood) as a function of latent variables M t=0,x=0,k=0 and M t=0,x=1,k=0 , representing the number of low-income inhabitants at the two locations of this model.The green star represents the ground truth value for these two variables.

√
2π exp (0) = −3.7).Intuitively, with just two locations, what matters is the relative attractiveness at one location compared to the other location.So, as long as there are fewer low-income inhabitants at one location than at the other location, several possible configurations of M 0 lead to very similar values for the loss.
Pt,x.The model assumes that this is a weighted mean (M11) of the maximum price that the average buyer is willing to pay, P B t,x (M10), and of the minimum price that sellers are willing to accept, P S t,x (M5).As a last step, we update the remaining state variables in Equations (M12) and (M13), simply by tracking the number of buyers and sellers in each class and location.

Table S2 :
[13]tion used by Pangallo et al.[13].The first block is indexes, the second block global parameters, the third block agent attributes, the fourth block location attributes.Buyer reservation price of agent i if s t,i = 0, seller reservation price if s t,i = 2 k Number of incoming agents of class k at any time step Agent s t,i State of agent i. s t,i = 0: Buyer.st,i = 1: Housed.st,i = 2: Seller x t,iLocation where agent i searches if s t,i = 0, otherwise location where it lives if s t,i = 1, 2 , Y t,i } i s.t.x t,i =x , Pt,x = Continuous Double Auction (Algorithm S2)(ν, {s t,i , P R t,i , Y t,i } i s.t.x t,i =x , P t−1,x ) Parameter ν, Agent variables {s t,i , P R t,i , Y t,i } i s.t.x t,i =x , location variable P t−1,x Output: Agent variables {s t,i , P R t,i , Y t,i } i s.t.x t,i =x , location variable Pt,x 1: Bt,x = {i s.t.s t,i = 0} Buyers 2: St,x = {i s.t.s t,i = 2} Sellers 3: if Bt,x = ∅ or St,x = ∅ or max i∈B t,x P R t,i < min i∈S t,x P R Input: