Inferring origin-destination distribution of agent transfer in a complex network using deep gated recurrent units

Predicting the origin-destination (OD) probability distribution of agent transfer is an important problem for managing complex systems. However, prediction accuracy of associated statistical estimators suffer from underdetermination. While specific techniques have been proposed to overcome this deficiency, there still lacks a general approach. Here, we propose a deep neural network framework with gated recurrent units (DNNGRU) to address this gap. Our DNNGRU is network-free, as it is trained by supervised learning with time-series data on the volume of agents passing through edges. We use it to investigate how network topologies affect OD prediction accuracy, where performance enhancement is observed to depend on the degree of overlap between paths taken by different ODs. By comparing against methods that give exact results, we demonstrate the near-optimal performance of our DNNGRU, which we found to consistently outperform existing methods and alternative neural network architectures, under diverse data generation scenarios.

Deciphering the origin-destination (OD) pair has been at the heart of various protocols that aim to evaluate the traffic demand and flow within complex systems. Interest in OD pairs results from the basic information it encodes on the distribution of people, materials, or diseases which has direct bearing on the socioeconomic phenomena of human mobility, resource allocation, and epidemic spreading. An intrinsic utility in gaining knowledge of the OD distribution is that paths with higher transfer rates can be enhanced to improve system's efficiency, or blocked to impede the transfer of malicious/undesirable entities. By far the most intensively studied OD problem for a complex system is the estimation of OD traffic of an Internet network from measurable traffic at router interfaces [1]. Collection of link traffic statistics at routers within a network is often a much simpler task than direct measurements of OD traffic. The collected statistics provide key inputs to any routing algorithm, via link weights of the open shortest path first (OSPF) routing protocol. Shortly after, similar techniques have been adapted for the problem of identifying transportation OD by measuring the number of vehicles on roads [2][3][4][5]. Such inference of vehicular OD has been used by Dey et al. [5] to give better estimates of commuters' travel time, or by Saberi et al. [6] to understand the underlying dynamical processes in travel demand which evolve according to interactions and activities occurring within cities.
Several techniques have been developed in an effort to estimate OD information from link/edge counts. These * Electronic address: Vee-Liem@ntu.edu.sg † Electronic address: vism0001@e.ntu.edu.sg ‡ Electronic address: sury0013@e.ntu.edu.sg § Electronic address: yang.bo@ntu.edu.sg ¶ Electronic address: mikaelj@kth.se * * Electronic address: lockyue@ntu.edu.sg include expectation-maximisation [7], entropy maximisation (or information minimisation) [8][9][10], Bayesian inference [3,[11][12][13], quasi-dynamic estimations [14,15], and the gravity model [16][17][18]. As the number of OD degrees of freedom (quadratically proportional to number of nodes) generally outnumbers the number of link counts (linearly proportional to number of nodes), the major issue concerning OD estimation is that the problem is severely underdetermined. Vardi [7] attempted to resolve this issue by treating the measurements of OD intensities as Poisson random variables, and used expectationmaximisation with moments to figure out the most likely OD intensities giving rise to the observed measurements of link counts. The non-unique solutions due to underdetermination had also been addressed through the principle of entropy maximisation, as well as by Bayesian inference through the assumption of prior OD matrices. Alternatively, ambiguities in OD inferences were treated by quasi-dynamic method using prior knowledge about historical trip data, or through parameters calibration of the gravity model based on zonal data. Invariably, these methods lead to large uncertainties and unreliable OD inferences.
In this work, we introduce a new OD inference approach using deep neural network (DNN) and a method based on linear regression (LR). We regress for the probabilities ζ ij of an agent going from origin i to destination j, instead of predicting the actual number of agents in the OD matrix. These quantities ζ ij (also known as "fanouts" [19]) are assumed to be stationary throughout the period of interest, and are thus always the same. For example, in a bus system, commuters in the morning have some preferred ζ ij , so the fan-outs are constant. But the number of people arriving the bus stops or boarding the buses need not be the same each time. This happens when a next bus arrives relatively quickly after the previous bus has left, with fewer people boarding it. Consequently, simple linear regression can be implemented to obtain ζ ij from repeated measurements over the period. The actual OD numbers are just the total numbers from the origins (which are easily measurable) multiplied by ζ ij . This approach overcomes the weakness of underdetermined system in previous works, where repeated measurements would correspond to different OD intensities, which cannot be combined. As a result, improvement in prediction accuracy is attained over earlier approaches such as expectation-maximisation and Bayesian inference, which we will demonstrate on a common network in Section I D.
Next, we go beyond linear regression by training a DNN with supervised learning to predict the OD probabilities ζ ij . Our approach is thus applicable to any arbitrary network that connects between a set of origin nodes and a set of destination nodes. In other words, our framework is network-free, i.e. directly applicable to any network without requiring explicit modelling and analysis. Furthermore, the DNN is composed of gated recurrent units (GRU) [20] which capture and process the temporal information in our data. The use of GRU is also necessary because our input data is in the form of a time-series, and the same GRU architecture can process time-series input data of arbitrary length. (In contrast, densely connected feedforward DNN would have the number of input nodes dependent on the length of the time-series data.) In comparison, analytical statistical frameworks like the Vardi's algorithm [7] and the Bayesian methods [3] necessitate explicit encoding of the network (referred to as the "routing matrix", related to the adjacency matrix of a graph) before implementation. Temporal information is not exploited as each datum is treated as being independent from the others.
We harness this DNN framework to study general complex networks with different topologies like lattice, random, and small-world, as well as real-world networks, to glean how network topology affects the accuracy in predicting the OD probabilities. Recently, there is great interest in using deep learning methods to solve problems in applications modelled by complex network, such as predicting the dismantling of complex systems [21], and on contagion dynamics [22]. While these papers trained deep learning approaches to identify topological patterns on dynamical processes in networks, they have not explored how topology of different complex networks affect the OD prediction accuracies. Incidentally, Ref. [22] made use of OD information of human mobility to study the spread of COVID-19 in Spain through a complex network model. They compared their DNN approach with a maximum likelihood estimation (MLE) technique and found that the former outperforms the latter. This outcome is analogous to our case because Vardi's expectation-maximisation algorithm with moments is in fact an MLE method. Moreover, unlike Refs. [21,22] which implemented graph neural network and graph attention network that made use of convolution of neighbouring nodes to significantly improve performance, we design a GRU architecture which leverages on temporal information to predict OD probabilities.

A. Data measurements
Before presenting the linear regression (LR) and deep neural network (DNN) approaches to infer the origindestination distribution, we elaborate on the data types that are easily measurable within generic real-world complex networks. In our setup, we primarily consider the dataset to be a time-series of T time steps where the number of agents are tracked at every single time step. Then, at a single time step, the measurable quantities are: x i which is the number of agents leaving origin i, y j which is the number of agents arriving at destination j, as well as F ab which is the number of agents on the directed edge from node a to node b of the complex network.
We employ x i and y j for LR due to its formulation. For general complex network, LR can only serve as an approximate estimator, unless the network is relatively simple enough to allow for an exact representation. In contrast, conventional algorithms for origin-destination estimation [3,7] use the number of agents on the edges, F ab . Hence, in developing our DNN framework, we will be using only F ab but not including x i and y j . This mode of data measurement is used in Sections I E and I F.
Nevertheless, in order for us to provide a fair evaluation of our DNN framework as compared to Vardi's [7] and Tebaldi-West's [3] approaches, we will consider a separate type of measurement in our comparison against these traditional approaches. In this case, we count the number of agents in x i and F ab over some fixed time interval, i.e. these numbers are aggregated instead of the actual numbers at every time step. Then, T such measurements are collected (where this T is now the number of independent aggregated measurements, instead of the length of the time series). This means of aggregated measurement is in fact implemented by Vardi and Tebaldi-West, although Vardi's formulation uses two variables: one counts the number of agents from origin to destination, and the other counts the agents on the edges, which he denotes by X i and Y respectively. To avoid confusion with our notation, we have renamed our x i as V i while retaining the use of the Vardi's vector of aggregated number of agents on all edges Y in our formulation. We will use these aggregated variables in our DNNGRU studies as well as that of our LR formulated using these variables described in the Supplemental Material (SM). Note that this mode of data measurement is employed in Section I D.

B. A Linear Regression Approach
Let us consider a complex network of nodes where agents from an origin node can end up at any destination FIG. 1: a: Hypothetical example of three origin data servers (red x) connected to three destination data servers (green * ) in Asia-Pacific. b: Pictorial visualisation of data server connections of (a). c: Network representation of (a, b). d: Blue Route of the NTU campus shuttle bus service [23,24]. e: Network representation of (d), where the nodes are linked via a semi-express configuration [25,26]. Each of the six coloured arrows represents one semi-express bus picking up commuters from distinct subsets of origins. Subsequently, all buses allow alighting at all destinations D1, · · · , D6. node in the network. In our context, an edge in the network corresponds to a carrier route that facilitates transfer of agents between the two nodes connected by it. For example, data server hubs are linked through a series of fibre-optic cables (carriers). As not all data server hubs are directly connected due to geography, data (agents) transfer between a pair of data servers may traverse other data servers along the way, as depicted in Fig. 1(a, b, c). In another example, the bus service in Fig. 1(d) comprises a loop of 12 bus stops served by buses going around. In other words, buses (carriers) must sequentially traverse one bus stop after another to deliver commuters (agents) in some fixed order along the prescribed loop.
In a network with M O origins and M D destinations, there are M O (M D −1) free parameters which quantify the probability distribution of agent-transfer from one node to another. This is because at each of the M O origins, there are M D destinations and these probabilities sum to 1. Whilst we do not know where each agent goes, we can write down this general multivariate linear system: subject to the constraints Here, x i are the number of agents the carrier picks up from origins i = 1, · · · , M O whilst y j are the number of agents that carrier delivers at destinations j = 1, · · · , M D . The sought after quantities are ζ ij , denoting the OD probabilities of agents from i to j. As Eq. (1) does not account for agent-transfer through the edges with different travelling paths between the same origin and destination, it serves basically as an approximation model.
Assuming that ζ ij are stationary and hence independent of time, repeated measurements will yield different x i and y j leading to an overdetermined set of equations given through Eq. (1). The OD coefficients ζ ij can then be deduced by linear regression (LR), given dataset (x i , y j ) to be fitted [27]. In other words, we analytically solve for ζ ij through the minimisation of the mean squared error. As LR gives coefficients ζ ij ∈ R, there are occasions when ζ ij < 0. We handle this by minimally shifting the entire ζ ij to make all ζ ij ≥ 0, and then normalise them so that the M O constraints below Eq. (1) are satisfied. This formulation constitutes a linear framework of OD estimation modelled by a complex network that relates to a particular system of interest. Depend-ing on the system under examination, the variables x i and y j are determined from the relevant measured empirical data. We have compared our LR approach with quadratic programming where the optimisation is performed by imposing the additional constraint that ζ ij is non-negative. While quadratic programming is observed to perform consistently better than LR at small data size, LR takes a significantly shorter time to reach the solution relative to quadratic programming. Both approaches nonetheless converge to the same solution when sample size increases.
In the case of the bus system, publicly accessible information such as the number of people on buses and the duration buses spend at bus stops [28] can be used to provide x i = number of people boarding and y j = number of people alighting the associated bus. In the context of data servers, x i and y j are deducible from the traffic load passing through the fibre-optic cables carrying packet bits since increased/decreased load is due to x i /y j from/to a data server.

C. DNN with Supervised Learning
The main idea of our approach is to determine the OD probabilities ζ ij from easily accessible or directly measurable information, specifically the number of agents on the edges of the complex network. This ability to infer ζ ij is particularly important when it is impossible to measure ζ ij . For example, it is extremely challenging for investigators to figure out the tracks of nefarious hackers/fugitives who would doubtlessly obfuscate their direct transfer of data packets across various data servers. Such problems like commuters in transportation systems as well as fugitive hunting motivate the inference of the most likely destination of agent transfer, from an origin, so that we can better improve service and connectivity or to better locate a target's whereabouts. We will thus measure the prediction accuracy of our algorithms developed in this paper in predicting the most popular destination with respect to an origin.
In order to deal with these more general situations, we develop a DNN architecture composed of GRU (i.e. DNNGRU) to output ζ ij by supervised learning, where training may be performed on real-world datasets which are amply available. In addition, labelled data can also be generated through simulation from systems with known characteristics for DNN to generalise from them to new unseen inputs. Fig. 2 gives an overview on the approach of this work, where information of the number of agents on the edges (e.g. number of people on buses after leaving bus stops) are passed into DNNGRU to infer OD probabilities of the network.
Our DNN can take as input a more generic form of the dataset with (x i , y j ) implicitly contained, as compared to LR described in the previous subsection. It thus encompasses a more general data and model structure of the system-under-study such that the actual mechanism of the carriers may be complex and nonlinear. Supervised learning then allows our DNN to learn directly and more generally from labelled ζ ij . Furthermore, we adopt a DNN with sigmoid activation to naturally output the correct range of ζ ij ∈ [0, 1]. An important characteristic of our designed DNN architecture is the incorporation of GRU to enable inference from time series. The GRU is a recurrent unit with hidden memory cell that allows for information from earlier data to be combined with subsequent data in the time-series. This is crucial as the dataset (x i , y j ) may carry temporal information. For instance, buses recently picking up commuters would leave less people for subsequent buses, hence cascading effects (like bus bunching, overtaking, load-sharing [23,29]) induce deviations from Eq. (1) which ignores temporal correlation as it treats each datum independently.

D. Comparison amongst DNNGRU, LR, EM and other traditional methods
We directly compare our DNNGRU and LR approaches with the traditional Vardi's expectationmaximisation (EM) algorithm, Willumsen's entropy maximisation, as well as Bayesian method on Vardi's network as a testbed [7]. For this purpose, we adapt the generation of data according to the approach of Vardi: The number of agents on the edges are measured over an extended time period [3,7]. This would be a setup where a counter tracks the overall number of agents that passes through an edge throughout the entire day, for instance. In this case, samples from various days are independent, with the number of people X k generated for each origindestination pair being a Poisson random variable with a fixed mean λ k . Vardi's network has four nodes, and consequently 12 OD components in X (see Methods for a review of Vardi's EM algorithm and notations).
We generate datasets in the following manner: For each dataset, each λ k is an integer chosen from [1,20]. Then, an X is drawn, with Y computed via Eq. (4). Here, Y is the vector of number of agents on the edges of the network. The ζ ij are computed from appropriately normalising λ with respect to each origin node. In other words, the ground truth of ζ ij are: ζ 11 = λ 1 /(λ 1 + λ 2 + λ 3 ), etc. Therefore, for each dataset, we have ζ ij and Y . The goal is to infer ζ ij from Y . We collect T samples of Y in each dataset for that specific ζ ij , and this provides the input for DNNGRU, as well as Vardi's algorithm. More datasets can be generated by resetting λ. For LR to be applicable as an exact model, we need to input an additional piece of information, viz. the total number of people originating from each origin node, V . The exact solution of ζ ij by LR is given in SM. This additional information is also provided to train an alternative DNNGRU, which we call "DNNGRU-V" to distinguish it from the version that does not include V as its input. Separate DNNGRUs (as well as DNNGRU-Vs) are trained to output ζ ij for each i, so a network with M O origin nodes has M O DNNGRUs. Furthermore for each i, we let T = 1, 3, 5, 10, 20, 30, 50, 70, 100, 140, 200, 300, 500, 750, 1000, and apply transfer learning (see Methods) to separately train different DNNGRUs for different T . Training comprises 250k datasets (where ζ ij has been randomised), with another 10k as validation during training. Additionally, 50k datasets (with randomised ζ ij ) are for testing.
After obtaining these probabilities ζ ij , we determine the most popular destination j with respect to origin i by selecting the corresponding j with the highest probability ζ ij . We also determine how accurate is the prediction of ζ ij with respect to the true value, by two different measures, viz. the difference ε = |ζ ij,predicted −ζ ij,true |, as well as the coefficient of determination r 2 (see Method) for the straight line fit of ζ ij,predicted = ζ ij,true . Fig.  3(a) shows the percentage error E in predicting the most popular destination, versus the number of samples T . Fig. 3(b) shows the percentage of the predictions where ε > 0.05, versus T . Fig. 3(c) shows 1 − r 2 for the straight line fit of ζ ij,predicted = ζ ij,true , versus T .
Evidently, DNNGRU outperforms EM with the usual data of number of agents on all the edges. We also observe LR to perform better than DNNGRU and EM, which results from the exploitation of the additional data of V . Note that V is easily obtainable for example by placing a counter that tracks every agent that leaves that node, without actually knowing where the agents are going. The use of V has also resulted in improvement in the outcome of DNNGRU-V, where it is observed to have superior performance relative to LR. DNNGRU (and DNNGRU-V) is advantageous here by benefiting from the many weights and biases to be tuned from supervised learning with a deep neural network. Basically, the enhanced performances of DNNGRU and DNNGRU-V over Vardi's algorithm and LR, respectively, arise from the gains accrued by learning from data. Nonetheless, the performance of DNNGRU-V would converge to that of LR in the asymptotic limit of large T since LR gives exact solution (see Fig. 1(a) SM). On the other hand, LR shows better performance compared to EM due to its lower modelling uncertainties from fewer underlying assumptions. As for entropy maximisation and Bayesian inference, the results are below par and hence not displayed. The latter two methods have not addressed the underdetermination problem and consequently are still burdened by the high uncertainty of their results. See SM for more details.
Incidentally, results corresponding to Fig. 3(c) where the predictions are the actual OD intensities instead of ζ ij are shown in Fig. 1(b) of SM. To predict the actual OD intensities, we just multiply the predicted probabilities ζ ij with the corresponding average total number of people at the origins. This is fully in accordance to how Vardi intended to apply his expectation-maximisation algorithm. In terms of the actual OD intensities, again LR, DNNGRU, DNNGRU-V all outperform EM.
A performance analysis and comparison between DNNGRU (without V ), LR, and EM have also been performed for a loop network ( Fig. 4(a)) with the same outcome observed. Details are given in Methods, with corresponding results displayed in Figs. 3(d, e, f). The data generation here is not based on measuring the agents over a specified time interval, as prescribed by Vardi's network [7]. Instead, it is based on the realistic flow of agents from one edge to the next edge, which we elaborate further in the next subsection on general complex networks. The point here is, DNNGRU and LR are superior to Vardi's EM algorithm. All three methods receive the same information of the number of agents on the edges at every time step, over T time steps.  [7] as well as LR and DNNGRU-V with additional data on the total number of agents from the origin nodes. Shown in parentheses of the legend in (c) are the exponents of a power law fit. d, e, f: Corresponding comparison amongst DNNGRU, LR, and Vardi's EM algorithm [7] for the loop in Fig. 4(a). g, h, i: Corresponding comparison amongst DNNGRU, LR and an analytical averaging for a lattice with MO = MD = 3 in Fig. 4(g). Here, LR in Eq. (1) is only an approximation as it does not track the exact number of yj. In (h), note that for T > 500, DNNGRU and the analytic treatment have 0% error in predicting ζij to be within 0.05 from the true value. Therefore, this would be −∞ on the log-scale, which is not shown.

E. General complex networks
Consider a network with M O origin nodes and M D destination nodes. At each time step, origin i can generate any number from 1 to P agents each assigned to go to destination j with probability ζ ij . Then, the agents propagate to subsequent edges in their paths as the time step progresses. Agents take the shortest path via edges of the network. If there are many shortest paths, one is randomly chosen (out of a maximum of n pre-defined shortest paths). We consider two versions, viz. when an agent traverses an edge, it spends one time step before leaving that edge to another edge or arriving at its des- tination (without lag); or it spends l time steps on the edge before proceeding (with lag). The number of agents on each edge is recorded every time step. The system is allowed to evolve, and the last T time steps are taken to train a DNNGRU.
We consider various such complex networks in our study (Fig. 4). As presented in the previous subsection where we compared with Vardi's EM algorithm, Fig Fig. 4(a) is also a real-world network of interest, and is studied in greater detail in Section I F. In our study, we have set P = 10, n = 4, and l ∈ {1, 10}. Also, the number of agents originating from each origin is changed every 20 time steps to introduce stochasticity and variation, though ζ ij is kept constant throughout each dataset. The adjacency matrices for the networks in Figs We refer to the number of active directed edges (ADE) as those links that are actually being used, i.e. forming part of the shortest paths from origins to destinations. The number of ADE for the networks in Figs. 4(a-g) are, respectively: 11, 38, 32, 24, 24, 40, 18. More ADE implies more information being tracked. Let the number of agents on every ADE at every time step form the feature dataset to be fed into the input layer of a DNNGRU, lasting T time steps (c.f. Fig. 2). Unlike the previous subsection where we tested on Vardi's network, information of V is not supplied to DNNGRU, i.e. we no longer consider DNNGRU-V in the rest of this paper.

A small lattice
In order to gain a deeper understanding on the performance of LR and DNNGRU, we employ a small lattice ( Fig. 4(g)) that is amenable to analytical treatment. The analytical treatment essentially tracks the paths of agents through the various edges, to arrive at a set of equations which exactly solves for all the unknown variables (elab- orated in SM). Specifically, as the agents propagate from an origin node of the small lattice, those who are destined for the same end node may traverse different paths, and the travelling pattern of the agents can be modelled analytically. Nonetheless, there is still stochasticity in the analytical treatment due to the stochastic nature of agents spawning at the origins. And with the evaluation of ζ ij based on the ratio of two integer number (of agents), there can still be uncertainties in the prediction of ζ ij . Note that the analytical treatment is achievable here due to the small lattice having few nodes with a relatively high number of edges. On the other hand, LR as given by Eq. (1) is only an approximation as it does not track the correct number of y j . Thus, we observe the better performance of the analytical treatment over LR in Figs. 3(g, h, i). DNNGRU matches with the analytical results, but appears to be slightly inferior in the asymptotic limit with large T , perhaps due to the optimisation of DNNGRU not finding the absolute best minimum of the loss.

Topological effects of the complex networks
Let us concern ourselves with the three networks: lattice; random; and small-world (Figs. 4(b-d)), which have the same number of origin and destination nodes, and also the same number of edges. Due to the different topologies of these networks, they have distinct ADE and thus diverse shortest paths that the agents could travel from the origin node to destination node. As DNNGRU learns a model that takes these different shortest paths of the agents into account while LR does not, DNNGRU outperforms LR in prediction accuracy as shown in using DNNGRU, i.e. those in Figs. 4(b-d). It turns out that the small-world topology performs the worst, given the same number of nodes and edges, because many shortest paths prefer taking the common "highway edges", the very property that makes it small-world. The ramification of this is its lesser ADE (at 24) which carries less information than the lattice or random networks with greater ADE. This observation is, however, inconsistent with the fact that the random network which has a lesser ADE of 32 outperforms the lattice network with a higher ADE of 38. The deeper reason is revealed by looking into the performance variation across different origin nodes within the network.
Recall that each origin node i is trained with its own Here, we highlight the origin nodes that are relatively worst in predicting the most popular destination, compared to other origin nodes. a: (lattice) O3 and O4 are the least accurate. Shown here is with respect to O3, which is the furthest from all destination nodes. So, O3 has many edges which overlap with the edges used by others. b: (random) Shown here are the edges used by O2 to get to all destination nodes, with O3 being a parasitic origin node. This is because whilst O3 uses O3 → D2 and O3 → D2 → O4 → D1, it otherwise overwhelmingly goes to O2 and traverses all the edges used by O2 to get to the destination nodes. c: (small-world) Shown here are the edges used by O1 to get to all destination nodes, with O2 and O3 both being parasitic nodes as they overwhelmingly go to O1 and traverse all the edges used by O1 to get to the destination nodes.
DNNGRU to exclusively predict ζ ij for that i. Fig. 7 reports the results for 1 − r 2 versus T with respect to each particular origin node. Corresponding results for the percentage errors of predicting the most popular destination as well as those for ε > 0.05 are given in SM.
Since the lattice, random and small-world networks have origin nodes with different topological properties, the agents could have alternative pre-defined shortest paths to take. Consequently, different network topologies do indeed lead to different origin nodes performing better than others. We can determine the relatively worst performing origin nodes in these networks, by examining the graphs in Fig. 7. For the lattice, origin nodes O 3 and O 4 (which are equivalent, due to the symmetry of the lattice) have a relatively higher error compared to the other origin nodes. For the random network, the worst performing origin nodes are O 2 and O 3 with O 6 also relatively poor, whilst those for the small-world network are Accuracy in inferring ζ ij relies on the ability to disentangle the individual i, j components from the combined information when they add up on the shared edges. The lattice in Fig. 8(a) shows how O 3 (and O 4 , by symmetry) tends to be inferior compared to the rest, because it is the furthest away from all destination nodes. This means that it has to traverse relatively more edges, and consequently overlap with more edges used by other origin nodes.
Another way an origin node can become inferior is due to a parasitic origin node that taps on it to get to the destinations. This is obvious in the random and smallworld networks, as depicted in Figs. 8(b, c). For the random network, O 3 has essentially all its paths relying on all those used by O 2 . This makes both of them suffer slightly worse performance, since it is more difficult to ascertain the ζ ij to be attributed to which of them. Other origin nodes have more diverse paths, allowing for more information available to infer their own ζ ij . Similarly for the small-world network, O 2 and O 3 are parasitic origin nodes with majority of their paths going via those of O 1 , resulting in all three of them being slightly inferior than other origin nodes.
Let E ij be the total number of edges that origin nodes O i and O j would overlap, when getting to all destinations. We summarise the values of E ij for these three networks in Table I. Generally, origin nodes that have greater overlap, i.e. larger E ij , would tend to be less accurate. These can be made more precise by encapsulating them into an index χ i for each origin node O i which quantifies the relative degree of overlap amongst them: In Eq. (2), E i is the number of edges that are involved in getting from O i to all destinations. The division by M O − 1 serves to make χ i as an averaged quantity over all the overlapping origin nodes (minus one to exclude itself). The division by E i is to normalise as nodes with more edges would tend to proportionally overlap more with other origins' edges. The rationale for taking the sum of squares of E ij (as opposed to just the sum of E ij ) is because an edge overlap involves two such origin nodes O i and O j , hence E ij should arguably appear as two factors. The squaring would also more greatly penalise higher overlapping numbers as compared to fewer overlaps. This is especially critical for an origin node with its parasitic companion which together share a larger amount of edges E ij , and which would raise both their indices values χ i and χ j by ∼ E 2 ij . As revealed in Table I, larger values of χ i would correspond to that origin node O i being less accurate (or larger error) relative to other origin nodes within that network, as deduced from Fig. 7. Incidentally, this index χ i also appears to correspond to the random network performing better than the lattice, with the small-world network   Figs. 4(b-d). The numbers in parentheses in the left most column of each crosstable indicate the total number of edges Ei involved for that node to get to all destinations. These crosstables show the number of edges Eij that origin node Oi overlaps with those of origin node Oj, when getting to any destination. Rows coloured in red show the poorer or poorest performing nodes for that network, as deduced from Fig. 7. As revealed by the crosstables, these correspond to large values of χi, signifying greater overlaps.
as the worst. The average valuesχ of χ i for these networks (random, lattice, small-world) are 1.46, 2.75, 3.94, respectively. This was, in fact, already deduced from Fig.  6 earlier. Intriguingly, the average indexχ explains why the random network with 32 ADE manages to outperform the lattice with 38 ADE: The former's averageχ is smaller which indicates less overlapping of edges used between the origin nodes. Thus, by accounting for the overlapping of edges as a measure of how hard it is to disentangle which agents are from which origins to which destinations, we obtain an indicator of which origin nodes are easier to predict the OD probabilities. On the other hand, we are able to verify this empirically using the model-free DNNGRU on each origin node (as displayed in Fig. 7).

No overlap, partial overlap, superhighway networks
In this subsection, we provide a direct illustration of the effects of topology and sharing of edges on the accuracy of predicting OD information through a study of idealised networks as shown in Fig. 9. Here, three network with six origins and six destinations are considered. The first has no overlap, so inferring ζ ij from the edge counts does not contain the uncertainties of the origin or destination on which the agent traverses, though there is still stochasticity involved since agents choose their destinations probabilistically. The second network has partial overlap, since two origin nodes share a common edge. The inference of ζ ij can be achieved by simple linear regression involving the pair of nodes, based on Eq. (1). Finally, the third network has all six origin nodes sharing a common superhighway edge. Similar to the second network, here linear regression is directly applicable with all six origin nodes.
The results are shown in Fig. 10 . It is evident that no overlap is the easiest to infer ζ ij , with the superhighway network being the hardest. Because LR provides an exact formulation of the three networks, it consistently performs the best given sufficient data. In all the three networks, we observe that accuracy improves in a power law manner with more data, with DNNGRU matching the results of LR. We can calculateχ for each of these networks. The index value for the no overlap network is zero, since E ij = 0 for all pairs of origins. For the partial overlap network, the sum of E 2 ij in Eq. (2) is only 7 2 as each origin only overlaps with one other origin. Finally, the sum of E 2 ij for the superhighway network is 5 × 7 2 since each origin overlaps with five other origins. Consequently,χ = 0, 1.225, 6.125, respectively for the no overlap, partial overlap and superhighway networks. This again showsχ as a measure of the degree of shared edges, with higher values signifying lower accuracy. Nevertheless, note from Fig. 10 that increasing the data size would improve the prediction accuracy in a power law manner in all cases, even for a superhighway network where all information pass through a common edge.

Real-world complex networks
Let us now consider the topology of three real-world complex networks: the directed loop, VinaREN, and MYREN. The directed loop corresponds to a bus transport network that provides loop service with M O = M D = 6 origin and destination bus stops. In comparison to the topology of the networks of the last section which have the same number of origin and destination nodes, the directed loop with only 11 ADE performs worst. This results from a larger average overlapped indexχ = 6.51 due to a greater number of overlapped edges and parasitic origin nodes which implies poorer accuracy as it is harder to disentangle the various ODs. Nonetheless, the directed loop is the only network amongst those in Figs. 4(a-d) where all origin nodes display essentially equivalent performance. The indices χ i for these origins are 6.00, 6.60, 6.91, 6.93, 6.63, 6.00, respectively, which are highly similar and consistent with them having comparable accuracies. The directed loop offers no alternative shortest path, so every agent from origin i to destination j takes the same path. The application of the directed loop for a bus loop service will be detailed in the next section.
For the internet networks VinaREN and MYREN, they have more destinations M D to deal with, and generally report larger errors in predicting the most popular destination. Their ADE are comparable to the smaller networks in Figs. 4(b-d), such that they do not quite possess proportionately sufficient ADE to deal with more nodes. This is a consequence of the features that many realworld internet networks are scale-free and small-world [31,32]. Whilst these two networks are relatively small to be considered anywhere near being scale-free, this is nonetheless plausibly the situation with scale-free networks which are ultra small [33,34] We consider a loop of M O origin bus stops with M D destination bus stops served by N buses in the form of Fig. 4(a). Without loss of generality, origins are placed before destinations, all staggered. Regular buses go around the loop serving all bus stops sequentially. To describe other complex network topologies in a bus system, we additionally implement a semi-express configuration where buses only board commuters from a subset of origins whilst always allow alighting [25]. This configuration turns out to be chaotic, but outperforms regular or fully express buses in minimising commuters waiting time with optimal performance achieved at the edge of chaos [26]. Let N = M O = M D = 6. This system represents a morning commute scenario where students living in the NTU campus residences commute to faculty buildings ( Fig. 1(d)). (The map shows 7 residences and 5 faculty buildings. We approximate this with M O = M D = 6.) The complex network for semi-express buses (carriers, i.e. edges) are depicted in Fig. 1(e). Each semi-express bus boards people from three origins, then allows alighting at every destination. Different semi-express buses do not board from the same origin subset.  (α, β), α, β; and by LR using α. b: Classification: Predicting the most, second most, · · · , least likely destinations from origin i = 1 by three different DNN architectures, and LR using α.
Unlike edges in general complex networks, each bus in a bus network maintains its identity as a carrier between nodes. In other words, passengers from different buses do not mix even if the buses share the same road. On the other hand for general complex networks, an edge connecting a pair of nodes would combine all agents that traverse it. This makes bus systems relatively easier to be analytically treated, and so we can compare performances by DNNGRU and LR.
We can simulate each of these two bus systems given some ζ ij to generate: α) the number of commuters on the buses after leaving a bus stop; and β) the duration that the buses spend at bus stops. Our parameters are based on real data measured from NTU campus shuttle buses [23,28,35]. We record 10 laps of service for each bus. This is reasonable as a bus takes ∼ 20 minutes a loop corresponding to 3.5 hours of service for the morning commute from 8 am to 11:30 am. All buses' time series are concatenated into one single feature, for each α and β. Empirically, this longer time series turns out to significantly improve performance, compared to each bus representing separate features. As before, DNN training comprises 250k datasets, with another 10k as validation during training. Additionally, 50k datasets are for testing. The latter are also used by LR to predict ζ ij . The various DNN architectural details are given in Methods.
With respect to i = 1 (other origins i = 1 to be trained separately), DNNGRU predicts ζ 1j using (α, β), α, or β. LR predicts ζ ij for all i, using α where (x i , y j ) are directly deducible, and from which we take ζ 1j for comparison. Fig. 11(a) shows the performance of DNNGRU clearly outperforming LR in predicting ζ 1j . The green, blue, orange, red bars respectively represent ε ≤ 0.05, 0.05 < ε ≤ 0.10, 0.10 < ε ≤ 0.15, 0.15 < ε, where ε is the difference between the predicted ζ 1j and the true ζ 1j . Generally, using α, β or both for DNNGRU are equally good. A semi-express network topology has higher prediction accuracy than that of regular buses since it allows for more diverse data to learn from. The former topology is also a more efficient bus system [25], and conceivably in general complex networks as it provides greater variety of pathways than one single loop (as discussed in Subsection I E).
From ζ 1j obtained by DNNGRU and LR using α, we order the destination ranking for origin i = 1 according to their probabilities and display their prediction accuracies in Fig. 11(b). These are compared with directly classifying destination ranking by GRU or dense layers using α. Direct classification requires training individual DNNs for each rank, i.e. one DNN learns to predict the most likely destination, another learns to predict the second most likely one, etc. Despite having dedicated DNNGRUs to classify each rank, they turn out to always be inferior to one DNNGRU predicting ζ 1j to then obtain the ranking. Intriguingly, they are only marginally superior (but not always) to LR. Consequently, if there are insufficient training data for DNNGRU, LR serves as a quick estimate with respectable accuracy compared to DNNGRU. Dense DNNs are consistently poorest due to its simplicity in not inferring temporal information in the time series, and also nowhere comparable to LR. DNNGRU regressing ζ 1j is always the best. Architectural details on using DNNGRU for direct classification and dense DNN for classification are given in Methods.
Classifying destination ranking produces a U -shape, with predicting the least likely destination being most accurate. This is generally true for all the networks studied in Fig. 4 as well, when trying to predict the other rankings other than the most popular destination. Errors in predicting ζ 1j may mess up the ordering, and those in between (2nd most likely, · · · , 2nd least likely) can be affected both above and below. The most and least likely ones are only affected from below or above, respectively, with the latter being bounded by zero -giving it slightly better accuracy. Incidentally, DNN direct classification generally tends to overfit from excessive training, with test accuracy systematically less than train accuracy. Conversely, DNN regressing for ζ 1j does not seem to suffer from overfitting with test accuracy remaining comparable to train accuracy.
These results are useful for practical applications, as it informs us which bus stops should be prioritised and which may be skipped if needed. Train loops are another common loop services. For instance, major cities in the world have light rail system connected to rapid transit system. Unlike buses, trains have dedicated tracks free from traffic. Therefore, they are cleanly scheduled to stop over prescribed durations and do not experience bunching. This implies β is not applicable, as the duration a train spends at a station is not proportional to demand. Nevertheless, α is measurable for inferring the OD distribution ζ ij of this train loop. Moreover, we can use DNN with supervised learning which is network-free on routes with multiple loops and linear/branching topologies.

II. DISCUSSION
Knowledge of ζ ij is akin to possessing knowledge of the dynamical law of the system. With respect to this dynamical law, we can figure out optimal configurations of buses for delivery of commuters from their origins to desired destinations. As a concrete implementation, the latter is achieved via multi-agent reinforcement learning (MARL) recently carried out [25]. However, arbitrary ζ ij were tested there like uniform distribution (fixed proportion of commuters alighting at every destination) or antipodal (commuters alight at the destination opposite where they boarded on the loop). More complicated ζ ij can certainly be prescribed for the MARL framework in Refs. [25,36], but the most useful one would be the actual ζ ij corresponding to the bus loop service being studied, like our NTU campus loop shuttle bus service [23,24,35]. Thus, the algorithm presented here is complementary to the MARL framework in Ref. [25]. We intend to further implement our algorithm to city-wide bus networks from publicly accessible data [28] with MARL optimisation, to be reported elsewhere. Other complex systems (c.f. Fig. 1) can be similarly modelled towards improving the efficiency of agent-transfer or impeding undesirable transactions.
In studying general complex networks, we revealed how different network topologies can lead to (dis)advantageous performances, quantifying in terms of χ i and the presence of alternative shortest paths. We also studied how different origin nodes' properties may make them perform slightly better than other origin nodes, in the examples of the lattice, random, smallworld networks, as well as real-world networks like the loop, VinaREN and MYREN. Notably, we demonstrated empirically that the use of recurrent neural network architecture like GRU allows for longer time-series data to generally yield more accurate predictions, with the timeseries data length and prediction error appearing to scale as a power law.
Whilst we have illustrated a concrete application in a university bus loop service, the framework presented here can be used in various other areas like Internet tomography [1,3,7,10,37], city-wide bus/traffic networks [3-5, 12, 14, 15, 17, 32, 38-50], as well as in mapping global epidemic/pandemic propagation and contact tracing [16,18,22,[51][52][53][54][55][56][57][58][59]. As we have shown how DNNGRU and our formulation of linear regression consistently outperform existing methods using expectationmaximisation with moments [7], Bayesian inference [3] and entropy maximisation [10], we expect further impactful advancements in OD inference based on this work. With DNNGRU being network-free where it only re-quires training it with data by supervised learning with no requirement of analytical modelling, this approach is scalable. We envision significant improvements in OD mapping in these other fields, bringing with them major enhancements whilst respecting privacy.

A. DNNGRU hyperparameters
We employ a deep neural network (DNN) with two hidden layers consisting of gated recurrent units (GRU). The input layer comprises the measurement of the number of agents on each active directed edge (ADE) of a complex network at some moment in time. If there are L ADE, then there are L units in the input layer. As the next two (hidden) layers are GRUs stacked on one another, the input data to the input layer can contain T time steps. In other words, each of the L units in the input layer comprises a time series of length T . Finally, the output layer comprises M D units (recall that M D is the number of destination nodes), each giving the probability of agents from some specific origin node i to end up at destination j. Different DNNs are trained individually for different origin nodes i. (See Fig. 2.) For a given setup, the input layer and output layer have fixed number of units. So, the number of GRU hidden layers and the number of units in each GRU hidden layer are user-defined. We fix each GRU hidden layer to have 128 units. We observe that increasing the width of the hidden layers generally improve performance. However, the number of trainable weights and biases in the DNNGRU would rapidly blow up with increasing width size. The choice of 128 units for each hidden GRU layer balances this, with ∼ 150k trainable weights and biases, which can be well-trained by 250k training datasets. On the other hand, whilst two hidden layers definitely outperform one hidden layer, three or more hidden layers do not show improved performance over two hidden layers. Thus, two hidden layers seem optimal.
DNNGRU training is carried out on TensorFlow 2.3 using Keras [60]. Default activation function is used for GRU, whilst the output layer uses sigmoid. Although the sum of all output values is 1 (c.f. the summation equation below Eq. (1)), softmax does not seem desirable as it sometimes leads to stagnant training. The loss function used is binary cross entropy, optimised with the standard ADAM. An L 2 recurrent regularisation is applied on the GRU hidden layers. This prevents blowing up of the weights which occasionally occurs during training if no regularisation is used. Batch size of 128 is used, which is generally better than 64 or 32. However, larger batch sizes would tax the GPU VRAM, leading to sporadic GPU breakdowns. Training is carried out over 200 epochs. Generally, optimal performance is achieved well before 200 epochs and continues to incrementally improve. No significant overfitting is observed, so no early stopping is implemented. In fact, for many of the DNNs, training accuracy is essentially similar to validation accuracy. Sometimes, the latter is slightly less than the former, but typically continues to improve over training epochs. Incidentally, input quantities which are small ( 100) are left as they are, whilst larger input values ( 100) are better to be rescaled by appropriate division, so that the DNN performs optimally.

B. Transfer learning
With GRU layers, the input layer can take any time series length T and as long as the number of ADE L remains the same, then the DNNGRU architecture remains the same. A longer T would require longer training time since the time series data are processed sequentially by the DNNGRU and cannot be parallelised. This unmodified DNNGRU architecture regardless of T allows the implementation of transfer learning when supplying inputs of different temporal lengths T to obtain each plot point in Figs. 3, 5, 6, 7. So with respect to each origin node i, a DNNGRU is trained completely from scratch with T = 1000. After this has completed, the trained weights are used as initial weights for the next shorter time series with T = 750, and trained for only 20 epochs instead of the full 200 epochs. This generally leads to optimised performance similar to that from complete training with random initial weights (we carried out some tests and this is generally true), but only takes 10% of training time. Then, the trained weights for T = 750 are used as initial weights to train a DNNGRU for T = 500, and so on until T = 1.

C. The coefficient of determination r 2
The coefficient of determination, r 2 is given by where ζ ij,predicted is the mean of all ζ ij,predicted . This imposes an exact fit of ζ ij,predicted = ζ ij,true with zero intercept, as opposed to the correlation coefficient of a linear regression fit that allows for an unspecified intercept to be determined from the fitting. Note that the coefficient of determination r 2 can take negative values, which would imply that the predictions are worse than the baseline model that always predicts the mean ζ ij,predicted (i.e. ζ ij,true = ζ ij,predicted , giving r 2 = 0). Hence, r 2 < 0 are not shown in the plots in Figs. 3, 5, 6, 7, as they imply that the predictions are just so poor, they are essentially nonsensical. In contrast, r 2 = 1 implies perfect predictions (since every ζ ij,predicted = ζ ij,true ).

D. Vardi's expectation-maximisation algorithm
We present an analytical comparison between linear regression (LR) and Vardi's expectation-maximisation (EM), to elucidate how a switch in paradigm from trying to infer actual whole numbers of the OD matrix to instead inferring the probabilities would lead to significant improvements in the predictions. It is instructive to consider a directed loop for this purpose as shown in Fig. 4(a), which is also what happens when a bus picks up people from M O origin bus stops and then delivers them at M D destination bus stops (c.f. Fig. 1(d)). This bus system is the subject of Section I F, which is based on the NTU shuttle bus service [23].
The bus would pick up x i commuters from each of the origin bus stops (i = 1, · · · , M O ), and then let y j commuters alight at each of the destination bus stops (j = 1, · · · , M D ) according to ζ ij , which leads to Eq. (1). In other words, everybody who alights at destination bus stop j for each j = 1, · · · , M D comprises everybody who boarded from each of the origin bus stops who wants to go to bus stop j. Since we can repeatedly measure all the x i and y j as the bus loops around over time, Eq.
(1) is naturally in a form where LR is directly applicable. The OD probabilities ζ ij are coefficients of a multivariate system of linear equations, and the best set of values of ζ ij is the one that would minimise the mean squared error of the predicted y j given the x i according to Eq. (1), versus the measured y j .
On the other hand, Vardi [7] approaches the problem differently. Instead of evaluating the OD number of commuters from the OD probabilities ζ ij , Vardi determines the actual OD whole numbers of commuters by modelling them directly as Poisson random variables with mean λ k , where k = 1, · · · , M O M D . With the correct Poisson parameters λ k representing the mean (and variance) of each OD from some O i to some D j , the actual number of people who want to go from O i to D j is the random variable X k ∼ Poisson(λ k ).
It is obvious in Eq. (4) that there are 36 unknown components of λ = (λ 1 , · · · , λ 36 ) T , with X ∼ Poisson( λ). However, there are only 11 independent equations, corresponding to each of the rows in Eq. (4). This is typical of OD inference problems where the number of ADE that provide measurements is less than the number of OD. Vardi's approach is to figure out the maximum likelihood of λ from observing Y . The key idea of overcoming the problem of underdetermination is to make use of moments (as well as the normal approximation given a large sample of measured Y 's) to generate more and sufficient equations (details in Ref. [7]), leading to an iterative algorithm that would converge to the optimal λ. Let Ȳ be the mean of all Y 's which are sampled by measuring the number of commuters at each ADE, and S be the corresponding covariance matrix arranged as a column vector.
On top of that, let B be a matrix obtained from A where each row of B is the element-wise product of a pair of rows of A (details in Ref. [7]). Note that S and B are the result of the normal approximation and the use of first and second order moments to generate more equations. Vardi then constructs the following augmented matrix equation This results in an iterative algorithm for computing λ: Here, λ k ,Ȳ i , S i , a ik , b ik are the components of λ, Ȳ , S, A, B, respectively. Once λ has been found, then we can normalise with respect to each origin to obtain the probabilities ζ ij .

E. DNN architecture for bus loop service
For the bus loop service, there are N buses serving the loop of bus stops. Unlike the situation for the general complex networks, here the input data comprises number of people on the bus after the bus leaves a bus stop, α; and/or the duration the bus spends stopping at a bus stop, β. In principle, we can let each input unit be the number of people on each bus, and/or the duration the bus spends stopping at a bus stop as a second feature for the input. Nevertheless, we find that concatenating these time series from all buses into one single long time series as the input turns out to empirically boost performance. This perhaps arises due to complex interactions amongst the buses, viz. the number of people picked up by a leading bus affects the remaining number of people picked up by trailing buses.
Hence for DNNGRU, the input layer comprises α and/or β, where these features are the concatenation of the time-series of all N buses. If only α or β is used, then the input layer comprises that one single unit with a concatenated time-series. The rest is the same as the DNNGRU architecture used for general complex networks, viz. two hidden GRU layers with default activation each with width of 128 units, followed by an output layer with M D units with a sigmoid activation. So the output is ζ ij for each destination j.
For directly classifying the rank of destination bus stop, the loss function used is sparse categorical cross entropy, which is the standard loss function used in a classification problem with many labels. Then, an argmax is applied to the output layer, such that the unit with the largest value is deemed as the prediction of the destination bus stop of the stipulated rank. This loss function compares the correctness of the predicted destination bus stop with respect to the actual destination bus stop. Di-rect classification differs from directly regressing for the values of ζ ij where the loss function used is binary cross entropy that measures the closeness of the values of the predicted ζ ij from the true value. Finally in the case where dense layers are used in the direct classification, the GRU units are simply replaced by regular densely connected feedforward layers. The input layer is not one single feature, but comprises all values of α which is then fed into two densely connected layers each of width 128 units followed by an output layer with M D units.

IV. DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.

I. LINEAR REGRESSION FOR VARDI'S NETWORK
Vardi's network [1] is defined by the routing matrix A, given by: ( This routing matrix relates the actual number of agents contained in the 12 OD components of X, and the number of agents at each of the 7 edges in his network Y , i.e. The 12 components of X comprise the number of agents going from a to b, a to c, a to d, . This view is in fact also explicitly stated in a Bayesian approach that uses Vardi's network in Ref. [2] (see the abstract of that paper). For an observer, we are only given information of Y , and the task is to infer the X that gave rise to the observation. As there are only 7 equations for 12 unknowns in Eq. (2), this is an underdetermined system.
Vardi's EM algorithm has been elaborated in the Methods section of the main text.
Here, we provide the solution by linear regression of the probabilities ζ ij , which is possible if there are additional data: the total number of agents originating from each node, V = . This is reasonably obtainable as a counter can track how many agents are leaving each origin node without knowing where they want to go. In the case of a bus system, this is simply tracking how many people are boarding the bus at each bus stop.
With knowledge of V , we can write X in terms of V and the probabilities ζ ij . These probabilities ζ ij are unknown, and now become the variables of interest. From a set of data samples of X, these cannot be combined as different samples have different sets of OD.
Nevertheless under the assumption that the probabilities are constant during the period of interest, the samples can now be combined to regress for this common ζ ij . Eq. (2) becomes: We have used the fact that probabilities originating from the same origin must go to one of the destinations, so that they all sum up to 1. Consequently, instead of 12 components of ζ ij , we only have 8 independent components, viz. ζ ab , ζ ad , ζ bc , ζ bd , ζ ca , ζ cd , ζ da , ζ dc . Note also that the 7th equation Y dc = V d is not useful, so we only have 6 independent equations.
Given measured data of V , Y , these 6 equations can be analytically solved for the 8 independent ζ ij so that the predictedˆ Y as compared to the actual Y would minimise the mean squared error, analogous to Eq. (20). We summarise the analytical solution for these optimal ζ ij as follows. Let σ be The solution is:

II. OTHER METHODS OF COMPARISON FOR VARDI'S NETWORK: EN-TROPY MAXIMISATION AND BAYESIAN INFERENCE
We implement Willumsen's entropy maximisation method to infer the OD trip matrix for Vardi's network [3]. The method is a direct application of the formula given in Ref.
[3] for entropy maximisation. This turns out to return poor results with coefficient of determination peaking around r 2 ∼ 0.039. Clearly, the assumption that the most likely trip matrix would correspond to the one with largest entropy or equivalently, one which minimises the information in the observed data is not necessarily the case with the way agents behave in general networks.
We also carry out a Bayesian inference of OD, whereby given a uniform prior of the parameter λ and the Poisson likelihood of observing Y given λ, we can use Bayes's theorem to obtain the posterior distribution of the parameter λ given Y . There exists several Bayesian inference methods which differ in some technical details in the literature [2,4,5]. For instance, Ref. [2] implemented their Bayesian inference on Vardi's original network in a manner that involves X, but did not compare with Vardi's EM in terms of performance.
Vardi himself also pointed out that their formulation of Bayesian inference is only applicable with one observation of Y and not directly generalisable to many samples of Y . Nevertheless, Vardi did suggest a more direct way of implementing a Bayesian inference (without the need to involve X) to obtain the posterior distribution of the parameter λ given Y , which we adopt here. The details of our implementation are presented in Section III below. Here, we Hence, the Bayesian model is saying that its posterior prediction of the mean value of λ is plagued by large uncertainty.
Recall that Vardi's EM creates more equations using moments, to overcome the underdetermination of the equations, whereas our LR introduced in this paper strives to regress for the common coefficients ζ ij such that all samples of measurements can be combined.
Instead, Bayesian inference and entropy maximisation methods do not address the underdetermination problem, and consequently are still burdened by the high uncertainty of their results.

III. BAYESIAN INFERENCE FOR VARDI'S NETWORK
To implement a Bayesian inference on Vardi's network, we strive to figure out the probability of the Poisson parameters λ given the observation of m samples of Y (1) , · · · , Y (m) , i.e. P ( λ| Y (1) , · · · , Y (m) ). Using Bayes's theorem, we have Here, P ( λ) = 12 k=1 P (λ k ) is the prior for the parameters λ whose 12 components are the 12 The performance of LR fits between DNNGRU and EM. Intriguingly, LR is comparable to EM when T is small (i.e. given a small dataset), but asymptotically matches the performance of DNNGRU for large T . Note that DNNGRU is always the best, with the exception of one data point where T = 300 in Fig. 3(g) in the main text when predicting the most popular destination. Whilst DNNGRU can give a prediction even with T = 1, both LR and EM require T ≥ 11 since the entire loop comprises 11 ADE and these methods require the observation of one full passage from O 1 to D 6 in order to calculate the necessary quantities.
Even so, DNNGRU with T = 1 is more accurate than LR and EM with T = 20, thus emphasising its superiority. Figs subject to the constraints: Note that since the M O origins are all before the M D destinations in the loop, the number of people alighting at the last destination is the sum of everybody who boarded minus the sum of everybody who alighted before the last destination: With this and the constraints Eq. (14), we can show that the last component of Eq. (13) (where j = M D ) is not independent from all the other components. For instance, consider the right-hand side: ζ ij x i using the constraints Eq. (14) = y M D using Eq. (15), (19) which is equal to the left-hand side. Thus, only M D − 1 equations in Eq. (13) are linearly independent.
Eq. (13) is satisfied for every carrier of the agent transfer linking the nodes in a complex network, or bus going across bus stops. By linear regression, the dataset (x i , y j ) is fitted so that the predictedŷ j according to Eq. (13) as compared to the actual y j would minimise the mean squared error: where m is the number of data points (x i , y j ). This is a convex minimisation problem, so a global minimum of J exists, which can be analytically calculated. This involves setting ∂J ∂ζ ij = 0, leading to a system of linear equations for all independent components of ζ ij . We This leads to the following analytical solution for ζ ij which minimises J. For N = M O = M D = 6, the solution is: where j = 1, · · · , M D − 1. The j = M D components are obtained using the constraints Eq.

B. Semi-express buses
. Then, we have the following equations for the semi-express buses A, B, C, D, E, F : where j = 1, · · · , M D − 1. Let σ A , σ B , σ C , σ D , σ E , σ F be M O by M O matrices, with components: and The solution is: where j = 1, · · · , M D − 1. The j = M D components are obtained using the constraints Eq. .

VI. ANALYTICAL TREATMENT FOR LATTICE WITH
For this lattice, here are the shortest paths from origin nodes respectively (c.f. Fig. 4(g) in the main text): Here, X 1 is the number of agents originating from the origin node O 1 at this time step, where X 1 is not known (and is varying over time). The number of agents from O 1 going to D 2 is given by ζ 12 X 1 , since ζ 12 is the probability for an agent from O 1 going to D 2 . This is the first term on the right-hand side of Eq. (46). The second term ζ 11 X 1 /4 is due to the agents from O 1 having probability ζ 11 of going to destination D 1 . Amongst these agents, 1/4 of them (on average) would take the path O 1 → D 2 → D 1 , since one of the four shortest paths are selected randomly. Incidentally, this term ζ 11 X 1 /4 is equal to D 2 D 1 (+1) where "(+1)" implies "at the next time step" because all agents would move to D 2 D 1 at the next Similarly, we obtain five more such equations: Therefore, ζ 12 X 1 , ζ 13 X 1 , ζ 21 X 2 , ζ 23 X 2 , ζ 31 X 3 , ζ 32 X 3 are given by the averages: where the averages are over the measured samples of number of agents on the respective directed edges at each time step.
Incidentally by inspection, we see that for ζ 11 X 1 , ζ 22 X 2 and ζ 33 X 3 , these are given by the averages: where the averages are over the measured samples of number of agents on the respective directed edges at each time step. Each of the respective four terms corresponds to each of the four possible paths that can be taken.
Recall that X 1 , X 2 , X 3 are unknown and varying in time. Nevertheless, we can obtain all the ζ ij by using the constraints Eq. (14): for i = 1, 2, 3. Therefore, Note that the expression for X i in Eq. (67) is the average of the samples. In fact, what we actually deduce from the edges via Eqs. (58)-(66) are these quantities Z ij = ζ ij X i , which allow us to get ζ ij thanks to the constraint Eq. (14). Writing in terms of Z ij , the probabilities ζ ij are: where we do not need to involve the unknowns X i at all.  Table 1 of the main text.

A. Lattice
The adjacency matrix for the lattice A L is: Here is the set of shortest paths from O i to D j for the lattice: The set of 38 ADE for the lattice is: Here are the edges that are being traversed by each origin node O i in the lattice, to get to any destination: The adjacency matrix for the random network A R is: Here is the set of shortest paths from O i to D j for the random network: The set of 32 ADE for the random network is: Here are the edges that are being traversed by each origin node O i in the random network, to get to any destination: The adjacency matrix for the small-world network A SW is: Here is the set of shortest paths from O i to D j for the small-world network: The set of 24 ADE for the small-world network is: Here are the edges that are being traversed by each origin node O i in the small-world network, to get to any destination: