Single-trajectory map equation

Community detection, the process of identifying module structures in complex systems represented on networks, is an effective tool in various fields of science. The map equation, which is an information-theoretic framework based on the random walk on a network, is a particularly popular community detection method. Despite its outstanding performance in many applications, the inner workings of the map equation have not been thoroughly studied. Herein, we revisit the original formulation of the map equation and address the existence of its “raw form,” which we refer to as the single-trajectory map equation. This raw form sheds light on many details behind the principle of the map equation that are hidden in the steady-state limit of the random walk. Most importantly, the single-trajectory map equation provides a more balanced community structure, naturally reducing the tendency of the overfitting phenomenon in the map equation.


INTRODUCTION
Community detection plays a vital role in various disciplines of science dealing with network data.It is an unsupervised learning task to classify the node set in a network into groups, or modules.Each subgraph that is identified as a module has no significant internal structure, while we expect each module to be structurally distinct.Thus, community detection provides a concise explanation of the dataset by ignoring detailed structures as noises within a module while preserving a significant macroscopic organisation as a module structure [1][2][3][4].Analogous to other machine-learning tasks, because it is often not apparent which parts of the dataset represent noise, community detection algorithms suffer from overfitting and underfitting problems [5].
The map equation [6] is a popular community detection method for networks and is formulated as a minimization problem of an information-theoretic objective function describing the average code length of the random walk.Infomap [7]-an implementation for a greedy optimisation of the map equation-has often been used to analyze real-world datasets.Furthermore, there are several extensions of the map equation itself [8-15, 17, 18], mainly focusing on incorporating higher-order network information.However, the map equation is prone to overfitting, particularly for sparse networks.Despite the map equation providing an optimal module structure for the description of the random walk on a network, an excessively fine module structure may be obtained.For example, as illustrated in Fig. 1 (left), many small modules are often identified in addition to a few large modules.When too many modules consisting of only a few nodes are identified, we have difficulty interpreting it as a concise explanation of the dataset.
This study revisits the map equation and considers its raw form, which we refer to as the single-trajectory map equation.Its objective function is the average code length of (not necessarily random) walkers with finite path lengths.The concept of the single-trajectory map equation already appears in the original paper [6] for a schematic description of the map equation.Nevertheless, this raw form has never been actively studied or utilised although it is a valuable variant with a mechanism to prune small modules and prevent the map equation from overfitting (as depicted in Fig. 1 (right)).
The emergence of small or highly unbalanced modules has been discussed in various contexts in the community detection literature.It is often considered to be the nature of real-world datasets [20][21][22], or a phenomenon that occurs because of the implementation details of an algorithm [21,23].In the context of the inference problem, the emergence of small modules is interpreted as artefacts caused by overfitting.Optimization-based (or maximum likelihood-based) methods are typically prone to overfitting, whereas methods based on Bayesian formulations avoid partitioning a network or subgraph where there is no statistically significant internal structure [24,25], that is, they avoid generating small modules.The map equation also has a Bayesian counterpart [12,13].Regardless of the underlying mechanism, the pruning of small modules is sometimes preferred in practice because it provides a more concise explanation of the network.A similar issue can be found in the regression problem in supervised learning [26].Although the ridge regression is a principled method, many variables with extremely small coefficients are often assessed as significant.In contrast, the lasso regression prunes such variables and provides a concise description of the dataset.
The single-trajectory map equation is a variant of the map equation that achieves a partition of coarser-resolution scale.Our approach differs from that of the hierarchical map equation [8] that achieves finer-resolution partitions [27].Although a high-resolution community detection method is useful when the network consists of several small modules, a method with a coarser resolution is also needed when an algorithm suffers from overfitting.For bipartite networks, [17] showed that a coarser resolution can be obtained by incorporating the bipartiteness property.A different map equation single-trajectory map equation Community detection based on the map equation and the single-trajectory map equation applied to a synthetic network.The nodes in the same colour belong to the same module.See Experiments section for details.
resolution scale can also be obtained by introducing the "Markov time" [12,28], which is an external parameter of the random walk.However, as shown in the following, the framework of the map equation can intrinsically prune small modules when formulated as the single-trajectory map equation and the balanced-size modules can be identified in a principled manner.

Revisiting the map equation
We proceed with the step-by-step formulation of the map equation.A prominent characteristic of the map equation is the hierarchical encoding scheme for the random walk using multiple codebooks that takes account of module structure in a network.As a specific example, let us consider the encoding of a trajectory in a network as shown in Fig. 2(a).We let ζ = {ζ 0 , . . ., ζ T −1 } be a trajectory of a walker, where ζ t ∈ V is the tth visited node.We also consider a partition σ = {σ 1 , . . ., σ N } of node set V (|V | = N ), where σ i ∈ {1, . . ., K} is the module label of node i ∈ V .K is the number of modules.A trajectory of a walker is encoded using two types of codebooks: inter-module and intra-module codebooks.The inter-module codebook describes the transitions of a walker moving into another module.In contrast, each intra-module codebook describes the walker transiting between nodes within the module or exiting the module.
The actual codewords for a trajectory based on the Huffman coding [29,30] are shown in Fig. 2(a).Starting with the code "10" for the module that the walker has first visited, the trajectory is described by indicating the visited nodes.Every time the walker moves to a different module, the exiting code of the previous module and the entering code of the next module are consumed.
In general, given a trajectory ζ of length T and module assignments σ, the average code length L (ζ, σ) is expressed as Here, we denote 0 (i, σ) as the length of the code in an intra-module codebook, indicating that a walker visits node i ∈ V in module σ; 0 ( , σ) as the length of the code in an intra-module codebook, indicating that a walker exits module σ; and 1 (σ) as the length of the code in an inter-module codebook, indicating that a walker enters module σ.
In equation (1), the first summation represents the code length for visited nodes and the second summation represents the code length for transitions between modules.The last term is the code length for the module at the starting point, with a negligible contribution when T 1.It is important that the codebooks are coupled; that is, because an exiting code from a module belongs to an intra-module codebook, transitions between modules affect the encoding of transitions within each module.
The principle of the map equation framework is that the compression of the average code length through the hierarchical coding reveals a module structure as an optimal partition σ.Readers might believe that the introduction of codewords for transitions between modules simply makes the code length longer.However, such a hierarchical encoding scheme can compress the average code length because it allows us to assign shorter codewords for visited nodes; for example, although the code "0" is assigned to two different nodes in Fig. 2(a), they are distinguishable because they belong to different modules.Therefore, when a trajectory rarely consumes the codewords for transitions between modules, the average code length can be compressed more efficiently.Equation (1) can also be expressed using visiting frequencies as follows: where i∈σ represents the sum over the node set in module σ. pi is the visiting frequency of node i ∈ V and pσ σ is the joint transition frequency from module σ to module σ , i.e., pi = 1 where δ ab represents the Kronecker delta.H σ 0 and H 1 are conditional average code lengths within the intra-and inter-module codebooks, respectively.
Recall that the random walk is a stochastic variable; there is no such thing as a single (finite-length) trajectory representing the random walk.Therefore, instead of a specific trajectory, we consider the expected average code length E Z [L (Z, σ)] in the map equation, where Z is the stochastic variable representing the random walk; in other words, E Z [• • • ] is the ensemble average over all possible trajectories (say, from all possible starting points).We assume that the trajectory length T is sufficiently large that the random walk is in a steady state.When the network is strongly connected, the empirical frequencies are converted to the corresponding steady-state probabilities.
σ pσ σ is converted to the entering probability into module σ denoted by q σ ; σ pσ σ to the exiting probability from module σ denoted by q σ ; and pi to the visiting probability of node i denoted by q i .The conditional average code lengths H σ 0 and H 1 are also converted to the expectations.According to Shannon's source coding theorem [30,31], these expectations are respectively bounded by the Shannon entropies, where q = K σ=1 q σ , p σ = q σ + i∈σ q i , and log is the logarithm with base 2.Then, the expected average code length of the random walk is bounded from below as follows: Note here that the contribution from the starting points of the random walk is excluded.This lower bound asymptotically coincides with the expected average code length itself as T → ∞.This is the objective function of the map equation and the node partition σ is optimised so that L (σ) is minimised.
The assumption that the network is strongly connected plays a vital role in the aforementioned derivation.If this is not the case, the trajectory length T cannot be sufficiently large.Then, the contribution from the starting points of the random walk may not be negligible in E Z [L (Z, σ)].Therefore, we can say that the map equation evaluates the code length of the "flow."It is a stochastic variable representing the ensemble of transitions, and it has no information about the starting points of the random walk by definition (as discussed below, this distinction becomes more prominent when we consider the --flow-model rawdir option in Infomap).The only input for the map equation is a network because the connectivity of nodes fully characterises the flow.By the introduction of so-called teleportation [32] to the random walk that moves the walker to another node randomly with a certain probability, we can always let the trajectory length T be infinitely large and make the flow ergodic [6].Therefore, the map equation is not essentially limited to strongly-connected networks.

Single-trajectory map equation
The average code length L (ζ, σ) of a trajectory is the raw form of the objective function in the map equation.When we have multiple trajectories {ζ a } := {ζ 1 , . . ., ζ M } on a common node set, analogous to the expected average code length E Z [L (Z, σ)], we consider the following mean average code length: Each trajectory may have different lengths.Similar to the L (σ), this mean average code length can be used as a minimization function to determine the optimal module assignments of nodes.We refer to such an optimisation method as the single-trajectory map equation.Note that the trajectories {ζ a } are provided as inputs in equation ( 9); unlike the map equation, there is no need to assume that they are generated (or simulated) from random walks, although one can consider simulated walks as trajectories.
The average code lengths in the summation of equation ( 9) are not independent because they share codebooks.To illustrate this, let us consider two trajectories as shown in Fig. 2(b).Although the trajectory ζ 1 is identical to ζ in Fig. 2(a), the codes describing them are different because we must assign codewords for the nodes that ζ 1 does not go through due to the existence of trajectory ζ 2 .In contrast, the nodes where no trajectories go through do not contribute to the average code lengths, reflecting the fact that the trajectories of finite lengths are considered.Those nodes should not have any module labels because there is no information based on the trajectories.
As we have seen, L (σ) and L (σ; {ζ a }) are conceptually different.In the map equation, L (σ) is the expected average code length for the flow that is completely specified by the transition probabilities.We can also modify the transitions using teleportation to make the random walk ergodic.By contrast, L (σ; {ζ a }) does not have such a stochasticity.It is the mean of the actual average code lengths, where each element corresponds to a single trajectory.Furthermore, L (σ; {ζ a }) depends explicitly on the coding scheme applied, e.g., the Huffman coding, Shannon-Fano coding [30], etc. Quantitatively, the contribution from the last term in equation ( 1) mainly makes the minimization of L (σ; {ζ a }) distinct from that of L (σ).The codeword for the module that is required to specify the starting point of a trajectory makes the coding using multiple codebooks less efficient.Recall that an efficient compression is achieved when the inter-module codebook is not frequently used.This implies that the introduction of module labels is more costly in L (σ; {ζ a }) and the single-trajectory map equation avoids generating many small modules.
The single-trajectory map equation searches for the node partition σ that achieves the optimal compression for the description of trajectories under a certain coding scheme.The optimality of the coding scheme itself is not required for the effectiveness of the method.Therefore, we can use different types of coding for the intra-module and intermodule codebooks.For example, we can introduce a heterogeneous coding where the code lengths are multiplied by a constant factor λ > 0 for the codewords in the inter-module codebook.That is, given a trajectory ζ and a partition σ, equation ( 1) is modified to Here, λ is a hyperparameter that penalises the emergence of modules when such modules are relatively inefficient for the compression of the code length.
We can also derive a lower bound for the actual code length using Shannon's source coding theorem, similar to how L (σ) was such an estimate for the random walk in the steady-state limit.To this end, we consider the average code length of the concatenated code, where T a is the length of the ath trajectory; equivalently L (σ; {ζ a }) when all trajectories have the same length.We regard the empirical frequencies pi and pσ σ as the true probabilities for the stochastic variables indicating the codewords and the concatenated code as the expected code length.Then, the conditional average code lengths in equation ( 3) are bounded from below by the Shannon entropies [30] with the empirical frequencies.Therefore, the average code length of the concatenated code is bounded as follows: where qi = 1 We regard L (σ; {ζ a }) as an alternative objective function for the single-trajectory map equation.L (σ; {ζ a }) is independent of the coding scheme and its minimization is computationally more efficient than that of L (σ; {ζ a }) because we do not need to construct the codebooks explicitly.Note that qσ and qσ may not coincide in equation ( 12), whereas q σ = q σ for any module in L (σ) owing to the detailed balance condition of the random walk in the steady state.Analogous to L λ (ζ, σ) in equation (10), we can also consider a heterogeneous coding in L (σ; {ζ a }), i.e., Interestingly, the method using L (σ; {ζ a }) is also related to a variant of the map equation that is implemented in Infomap as an option named --flow-model rawdir.In this variant of the map equation, we consider the flow based on the set of transition probabilities induced by the edges (i.e., not the random walk on the network).The corresponding objective function is in fact equivalent to L (σ; {ζ a }) where we ignore the codewords for the initial module and the initial node in each trajectory (consequently, the total length of trajectories M a=1 T a is also modified to  Before moving on, let us compare how the minimizations of L(σ), L(σ; {ζ a }), and L (σ; {ζ a }) differ using a simple example.We consider a trajectory where a walker visits each node exactly once on a path.Figure 3 shows the results obtained through the exact minimization of the objective functions.It quantifies how the average code lengths approach a common value as N increases, because the contribution from the starting point of the trajectory becomes negligible.L (σ; ζ) and L (σ; ζ) quickly approach to each other, whereas L(σ) converges relatively slowly, implying that the contribution from the codeword of the initial module can be considerable.We also confirmed that the singletrajectory map equation indeed tends to identify a smaller number of modules, and the resulting partitions can vary depending on the coding scheme applied.
The exact minimization of the (expected) average code length is not computationally feasible unless a dataset is extremely small, and thus, we must rely on approximate heuristics in practice.The greedy heuristic implemented in Infomap is commonly used for the map equation.Therefore, we implemented the optimisation for the singletrajectory map equation as a wrapper of Infomap.That is, we first run Infomap as the initial state of the node partition, and then, reduce overfitting by pruning small modules based on L(σ; {ζ a }) or L (σ; ζ) as a fine-tuning process; our fine-tuning algorithm is also a greedy heuristic.In the following, we refer to this algorithm as Infomap+, and the implementation code is publicly available [33].Further details of the algorithm are described in Methods section.

Experiments
This section demonstrates that the single-trajectory map equation prevents overfitting using datasets represented as networks and a real-world dataset as a set of trajectories.A network is a special case of trajectory datasets because each directed edge can be regarded as a trajectory with length T = 2.We treat each edge as a pair of directed edges in both directions for an undirected network.All networks considered are weakly connected.
For the network datasets, we can also consider simulated walks on the underlying network as the input trajectories.In this setting, we would need to specify the type of simulated walks and choose the values of T and M as hyperparameters.Herein, however, we do not consider the simulated walks and treat the edges set directly as the set of trajectories.

Network datasets
We first consider synthetic networks that are generated by the stochastic block model (SBM) [25,[37][38][39], which is a random graph model having a planted (pre-assigned) module structure.This is a canonical model that is used for analyses in community detection.We particularly consider the so-called symmetric SBM that has two equallysized planted modules.Each pair of nodes in the same planted module is connected with probability p in and each pair of nodes in different planted modules is connected with probability p out .The symmetric SBM is commonly parameterized by the average degree c and the fuzziness of module structure = p out /p in instead of p in and p out .and algorithms for the single-trajectory map equation based on L(σ; {ζ a }) and L (σ; {ζ a }) ("Infomap+") on the symmetric SBM (N = 1, 000, c = 12).We generated five instances of the SBM for various fuzziness of the module structure and plotted the distribution of the resulting relative module sizes.Herein, we set λ = 1.The performances of all algorithms change around the algorithmic detectability limit ≈ 0.15, which is distinct from the information-theoretic detectability limit located at = ( √ c − 1)/( √ c + 1) 0.55 [34][35][36].
The detection of planted modules is easier when is small because the module structure is clearer.Even when < 1, there exists a critical value of above which it becomes impossible to identify the planted module structure better than by chance; this is known as the detectability limit [24, 34-36, 39, 40] (in N → ∞).For these networks, the single-trajectory map equation cannot be the best method, as the Bayesian inference methods based on the SBM can avoid overfitting at all. Figure 4 shows the results of community detection based on the map equation and the single-trajectory map equation applied to the SBM.Each point represents the relative module size, which is defined as N i=1 δ σi,σ /N for module σ.The results based on modularity maximization (the Louvain [41] and Leiden [42] algorithms) are also shown for comparison.The --two-level option in Infomap indicates that it is the method introduced in the original paper for the map equation.
The Infomap (incorrectly) identifies several small modules even when the module structure is relatively clear, whereas Infomap+ prunes such small modules and identifies the equally-sized modules.The network plots in Fig. 1 are the results of the same experiment but with the SBM parameters N = 300, c = 8, and = 0.1.Although the modularity-based algorithms also identify small modules, Infomap is more prone to overfitting in the region where is small.This phenomenon can be described by the map equation having a finer resolution limit [27] compared with that of the modularity [43], i.e., the map equation can identify smaller modules.Note, however, that the analysis of the resolution limit is based on an extreme-case example that has a well-defined module structure; it does not describe the whole behaviour in Fig. 4. In the region of above the detectability limit ( ≈ 0.15 [40]), the modularity- based algorithms subdivide the planted modules into a number of smaller-sized modules.This is problematic because a practitioner can hardly realise when the resulting partition is due to overfitting.In contrast, most of the map equation-based algorithms do not partition a network in that region, implying that they avoid overfitting.
Although we showed the relative module sizes obtained by the algorithms, readers might wonder whether the identified modules are actually consistent with the planted ones.In Supplementary Fig. 1 (Section S2), we confirmed that the inferred and planted module structures are indeed highly consistent when the number of modules is correctly estimated.Note also that community detection algorithms generally suffer from overfitting and underfitting more severely when the average degree c is smaller.Therefore, all the methods considered here are expected to perform less accurately when c is extremely small.
The experiments here can be conducted for larger networks.In that case, however, some of the plots in Fig. 4 would be unnecessarily difficult to read because we would have many more points due to ovefitting.Moreover, the comparison with the previously known result on the detectability limit may be difficult for larger networks, because it is observed in [40] that algorithmic detectability limits of greedy algorithms can be size-dependent.
We then apply the algorithms for the single-trajectory map equation to real-world networks.Figure 5 shows the relative module sizes obtained using Infomap and Infomap+.It shows that small modules are pruned, yet larger modules remain identified in most of the cases with Infomap+.Although all variants of Infomap+ often provide similar partitions, empirically, the Huffman coding method finds a good balance of module sizes in real-world networks.The datasets considered here are often analyzed in the literature on community detection.For example, readers can compare the results here with those of Bayesian inference methods reported in [5,[44][45][46][47].
In Fig. 5, the value of the hyperparameter λ, which acts as a resolution parameter, is adjusted for each network so that the size of the smallest module is not less than min{3, N/100} (this adjustment can be performed automatically).The selected values of λ and the details of experimental settings and datasets are provided in Supplementary Table 2 (Section S3).We also examined the λ-dependency in Fig. 6, and we found that the number of modules varies within 1 ≤ λ < 2 in many datasets; in the Method section, we show that λ = 2 is a practical upper bound according to the resolution limit.Note that the threshold min{3, N/100} is only a reference to determine a reasonable value of λ; when Infomap+ excessively prunes modules, one can directly tune λ to resolve the underfitting problem.

Bike-sharing dataset: Application to a set of trajectories
Finally, we compare the methods using a dataset of a bike-sharing service in London [11,48], which is a dataset consisting of trajectories (sequences of bike stations visited) that individual bikes have travelled in a day (see Supplementary Information (Section S4) for the details of the dataset); thus, T a is the number of stations that bike a has visited.Figure 7(a) illustrates trajectories of three bikes in the dataset.Community detection of the trajectories identifies the area within which a bike is often used.Figure 7(b) shows the partition obtained by minimising L(σ) using Infomap; here, we constructed a network by decomposing each trajectory into a set of edges between successive pairs of stations.As a result, we obtained eight modules; in addition to four large modules, several modules consisting of only a few stations are also identified.In Fig. 7(c) which shows the partitions obtained by minimising L(σ; {ζ a }), we no longer observe the small modules.Although Fig. 7(c) is of the Huffman coding method (λ = 1), we obtain the same partition with the Shannon-Fano coding method (λ = 1) and with the method minimising L(σ; {ζ a }) (λ = 1.8).

DISCUSSION
This study revisited the formulation of the map equation and shed light on many details hidden in its principle.We addressed the fact that the encoding of trajectories is qualitatively distinct from the encoding of the flow on a network and proposed the single-trajectory map equation.Importantly, the proposed method can prune small modules and prevent overfitting.
The single-trajectory map equation provides a more balanced community structure compared with the map equation.Whereas balanced partitions may not always be desirable, it is often beneficial because we can prune spurious modules due to overfitting as demonstrated in Fig. 1.Furthermore, the analysis in the Method section implies that the single-trajectory map equation is not prone to underfitting compared with the map equation because their resolution limits are almost the same when λ = 1.Readers might wonder if the present approach is distinct from other variants of the map equation, such as the one with the Markov-time parameter [12,28] and the Bayesian formulation of the map equation [12,13] which is an improved teleportation method [50].To clarify this point, we also conducted experiments analogous to those described in Experiments section using these methods in Supplementary Information (Section S5).In some cases, these methods also exhibit similar partitions as in Figs. 4 and 5.However, they are apparently not particularly suitable for pruning small modules because these methods are more sensitive/insensitive to the choices of the hyperparameters that we need to search for the optimal values in a finer scale/wider range, while balanced partitions are often obtained without tuning the hyperparameter λ in the single-trajectory map equation.We also emphasise that the single-trajectory map equation is not a generalisation of the map equation, but its raw form, and overfitting is avoided using the principle of the map equation itself.
The bootstrapping method [51] is another approach for avoiding overfitting.However, this approach is computationally expensive [13] and a comparison with the present approach is not very clear because the output is a population of partitions.A more detailed study of the qualitative and quantitative relationships between the single-trajectory map equation and other variants of the map equation is left for future work.Furthermore, because the single-trajectory map equation is a trajectory-based approach, the relationship between the memory-network extension [10,13] of the map equation is another potential research direction because both take a set of trajectories as the input.
The time complexity of the optimisation algorithm is a major issue in the single-trajectory map equation.Whereas the lower bound L (σ; {ζ a }) can be optimised as efficiently as Infomap, explicit construction of the codebooks is required for the actual average code length L (σ; {ζ a }).Although our implementations of Infomap+ run within a reasonable amount of time for fairly large datasets as demonstrated in Supplementary Information Fig. 6, an improved implementation is also left for future work.

Optimisation algorithm
Herein, we explain the implementation details of the greedy heuristic.A typical greedy heuristic for community detection, including Infomap, iteratively merges two or more modules that improve the value of an objective function [41,42,52] and equally-sized modules are often preferentially merged [23].Such an update rule does not effectively compress the average code length at the stage of fine-tuning.This is not surprising because the initial partition is located at a local or global minimum of the objective function in the map equation, which may also be a local minimum in the single-trajectory map equation.Moreover, there is no reason that equally-sized modules should be preferentially merged.Although we typically have a few large and many small modules as the initial partition, it is unlikely that merging those small modules provides better compression of the average code length.Therefore, instead, we iteratively merge the smallest module and its most tightly-connected module regardless of the resulting value of the average code length until only one module is left; among the partitions that form with this merging process, we accept the partition that achieves the minimum average code length.Given an initial partition, this algorithm is deterministic.Although this algorithm is straightforward and the resulting partition may not be the global optimum, an improved compression of the average code length can be achieved by pruning small modules without being trapped into the local minima of the objective function.
When we use the lower bound L λ (σ; {ζ a }) as the objective function, the greedy update can be performed as done in Infomap.The expanded form of L λ (σ; {ζ a }) is The last term in equation ( 14) is independent of partition.Therefore, when we merge two modules, we only need to keep track of changes in qσ , qσ , and pσ , which are defined in equation (12).In these quantities, M a=1 δ σ ,σ ζ a0 is the population of the starting-point nodes in module σ, σ qσ σ and σ qσ σ are the sums of the populations of the transitions across modules in the set of trajectories, and i∈σ qi is the sum of the node-visiting frequencies in module σ.They are O(K) quantities, such that we can efficiently compute the change in L λ (σ; {ζ a }) when two modules are merged.
When we use the actual average code length L λ (σ; {ζ a }) as the objective function, the greedy update cannot be computed as efficiently as for L λ (σ; {ζ a }).When two modules are merged, we need to reconstruct the intra-module codebook of the target module as well as the inter-module codebook to compute the updated code length.The time complexity of constructing a codebook depends on the specific coding scheme applied.In Supplementary Fig. 6, we show the running times of Infomap and Infomap+ on the SBM; herein, we used the Infomap API [53] (a C++-based implementation with a Python wrapper) for Infomap and our Python-based implementation [33] for Infomap+.

Resolution limit
Readers might consider that the pruning effect implies that the proposed method is prone to underfitting.To examine this issue, we derive the resolution limit of the single-trajectory map equation focusing on L λ (σ; {ζ a }) and network datasets.The resolution limit is the smallest module size that the method can identify given a network size such as the total number of edges.
The following analysis shows that, although the method has a relatively coarser-resolution scale compared with the standard or hierarchical map equation, it is still a high-resolution method.The analysis also provides a theoretical explanation of some of the empirical results we obtained through the experiments in Experiments section and an implication to the range that the hyperparameter λ should take.

General form
We closely follow the derivation in [27], which is applied to undirected networks.The present resolution limit is for directed networks and the considered objective function is L λ (σ; {ζ a }) instead of L (σ).
We first rewrite the empirical frequencies of the walkers and the objective function in the single-trajectory map equation in terms of network statistics.When the input trajectories are the edges in a directed network (i.e., the number of trajectories M is the number of directed edges), we have where σ is the number of directed edges within module σ; in σ and out σ are the numbers of in-coming and out-going edges of module σ, respectively; d in i and d out i are the in-and out-degrees of node i; and C is the cut size of the network, i.e., the total number of directed edges that are crossing different modules.Using equation (15), the objective function is recast as where ) (dark blue lines), and the modularity (dashed grey line).Each line represents the phase boundary above which a clique is not resolved as a module because the network is too large (e.g., the marked region represents the undetectable region in the map equation).
In the resolution-limit analysis, we consider two well-defined modules and derive the condition under which their merging is favoured (i.e., the modules are not resolved) for better optimisation of the objective function.Thus, we evaluate the condition such that the difference in the objective function ∆L σ λ becomes negative when two modules are merged.We denote the labels of two well-defined modules as A and B and the merged module as AB.We also denote the change in σ L σ λ through the update as R, i.e., Here, R is a local quantity that depends only on the variables within/around modules A and B. When two well-defined modules are merged, the cut size is decreased by a small δ (δ M + C).The difference in the objective function based on the update is where e is the basis of the natural logarithm.Therefore, the resolution limit is generally expressed as In the map equation, the cut size C is the only global term that is responsible for the resolution limit (see equation (11) in [27]).By contrast, the single-trajectory map equation has the total number of directed edges M as another global term in equation (20).Note, however, that the contribution from M is logarithmic, implying that the single-trajectory map equation is still a high-resolution method.Next, we will derive a more explicit scaling.

Ring of cliques
It is common to consider a "ring of cliques" in a resolution-limit analysis, as illustrated in Fig. 8(a).We consider m cliques (each of which consists of n nodes) and connect each with a single edge to form a ring.This is an undirected network.Again, we treat each undirected edge as a pair of directed edges in both directions.We regard each clique as a module, and using this example, derive the resolution limit in a more explicit form.
When we merge two of these cliques, the cut size is decreased by 2. We denote σ = n(n − 1) = for an arbitrary module.Assuming that 1, we have Substituting equation (21) into equation (20), we obtain Each clique is resolved as a module unless (M + C) λ is larger than the left-hand side of equation (22), which is an exponentially growing function with respect to the clique size .Figure 8(b) depicts the resolution limits of the single-trajectory map equation, together with those of the map equation [27] and modularity [43].Although n and m are integers, we treat them as real numbers to highlight the scaling of each resolution limit.The resolution limit with λ = 1 is extremely close to that of the map equation.Therefore, the single-trajectory map equation is not prone to underfitting compared with the map equation.
When λ is large, modules with a small n are not resolved for any network size.However, the limit rapidly disappears as n becomes larger, whereas the resolution limit of the modularity disappears relatively slowly.This dependency of the resolution limit partially explains the favourable behaviour of the single-trajectory map equation, i.e., small modules are pruned yet large modules continue to be identified.However, as pointed out in the main text, the resolution limit does not describe the full behaviour of the single-trajectory map equation; it is not λ that plays a critical role in the method and λ = 1 is often sufficient to avoid overfitting.
In the left-hand side of equation ( 22), the leading coefficient in the exponent becomes negative at λ = 2.In this case, a clique will not be resolved as a module for any network size regardless of its size n, i.e., the ability as a community detection method will be completely lost.This transition implies that the optimal value of λ is usually located within 1 ≤ λ < 2, which is indeed consistent with our experimental results in Fig. 6.

S1. SUMMARY OF THE AVERAGE CODE LENGTHS
In Table I, we show a summary table of the average code lengths considered in this study.For the experiment conducted on the SBM in the main text, we examined whether Infomap+ can correctly estimate the planted number of modules.Readers might have doubts whtether the partitions obtained by Infomap+ are consistent with the planted module structure even when the number of modules is accurately estimated.To clarify this point, we conducted the same experiment on the SBM and measured the fraction of the correctly classified nodes, which is defined as where σ i ∈ {1, 2} is the inferred module label and σ * i ∈ {1, 2} is the planted module label.Note that N i=1 (1 − δ σi,σ * i )/N = 1 indicates that the algorithm perfectly inferred the planted module structure, but with the opposite module label for each node.The value of Eq. (S1) ranges from 0.5 to 1. Figure S1 shows that, when Infomap+ correctly estimates the planted number of modules (K = 2), the fraction of correctly classified nodes is indeed high.).We generated 10 instances of the SBM for each fuzziness of the module structure and omitted the partitions with K = 2. Again, we set λ = 1.

S3. DETAILS OF THE EXPERIMENTS ON REAL-WORLD NETWORK DATASETS
This section describes the details of the real-world networks analysed in the main text and the settings of the algorithms applied.Table II shows the types, number of nodes, and number of edges of the datasets (see the references for the description of each dataset).Table II also lists the value of the hyperparameter λ used in the single-trajectory map equation.Although the Les Miserables network was originally distributed as a weighted network, we converted it to a network with multiple edges because the edge weight represents the number of scene coappearances of characters.
Recall that a large value of λ penalises the generation of new modules.Therefore, starting with λ = 1, we increased the value of λ little by little (here, with a step of 0.1) until the size of the smallest module became sufficiently large.As far as we have investigated, λ = 1 is often already sufficient.
As the option in Infomap, we used --two-level for the undirected networks and --two-level, --directed for the directed networks (these are the methods introduced in the original paper on the map equation).The comparative analysis is fairer and more nontrivial with these options than the experiments with the --flow-model rawdir option because the distinction between the flow-based method and trajectory-based method becomes more prominent.We employed the --two-level constraint because the evaluation of multilevel partitioning is beyond the scope of the present study.
We can obtain a smaller-sized module or a smaller number of modules using Infomap+ than when using Infomap.This is because Infomap returns different results with different options (recall that we use --flow-model rawdir for the initialisation in Infomap+) or different runs.

S4. BIKE-SHARING DATASET
The bike-sharing dataset analysed in the main text was constructed using the dataset distributed through [S11].The original dataset consists of riding records of a bike-sharing service in London during 2014; for each use (travel), Figures S4 and S5 show the performance of the Bayesian Infomap [S12, S13] on the SBM and real-world networks, corresponding to the experiments performed in the main text.The Bayesian Infomap has a hyperparameter λ that specifies the strength of the prior distribution based on a random network.The default value is λ = (ln N )/N ; in Infomap, the parameter regularisation strength controls the coefficient a in λ = a(ln N )/N .
Overall, the Bayesian Infomap is highly sensitive to the choice of λ.As observed in Fig. S4, in many cases, the Bayesian Infomap either leaves many small modules or identifies the whole network as a module.The same tendency was observed for the real-world networks, as shown in The number of identified modules and the value of the Markov-time parameter τ are depicted at the top of each result.We selected the minimum value of τ such that the smallest module is not less than max{3, N/100} (the dashed line represents 0.01); otherwise, we set τ = 100 (and denote "τ > 100").
the value of λ from zero (with a step of 0.1) until the size of the smallest module was not less than max{3, N/100}.As a result, the Bayesian Infomap did not identify nontrivial modules for the large networks.We also conducted a version in which the threshold of the smallest module was max{3, N/1, 000}.However, the same number of modules was obtained for each dataset.In summary, although the Bayesian Infomap also aims to avoid overfitting, its performance is distinct from that of the single-trajectory map equation.For the datasets we have investigated, it was not easy to prune small modules while continuing to identify large modules.However, it should also be noted that the Bayesian Infomap is a highly flexible method that the performance can be improved by tuning the prior distribution more carefully.

FIG. 2 .
FIG.2.Trajectories (yellow solid lines) on a network and their encoding for a given node partition.Nodes in the same module have the same symbol and colour representations.The codewords in each codebook are listed on the right.Whereas there is only one trajectory in (a), another trajectory is added in (b).The average code length of each trajectory is shown at the bottom.
FIG.3.Average code lengths for a trajectory on a path (illustrated at the top) obtained by minimising L (σ; ζ) ("Huffman": Huffman coding, "Shannon-Fano": Shannon-Fano coding) and L (σ; ζ) ("lower bound").The expected average code length L(σ) ("map equation") based on the set of transition probabilities induced by the edges, i.e., the --flow-model rawdir option, is also shown.The number of detected modules in each method is indicated at the top of each bar.

FIG. 5 .
FIG. 5. Relative module sizes obtained by Infomap and Infomap+ for real-world networks.The number of identified modules is depicted at the top of each result.The dashed line represents 0.01.

FIG. 6 .
FIG.6.Number of modules identified for each value of the hyperparameter λ in Infomap+.

FIG. 7 .
FIG. 7. Community detection of the bike-sharing dataset.(a) Three trajectories in the dataset, where each point (node) represents the location of a bike station.The partitions of the stations are obtained by minimising (b) L(σ) (detected 8 modules) and (c) L(σ; {ζ a }) based on the Huffman coding (detected 4 modules).The stations in the same module are indicated in the same colour.

FIG. 8 .
FIG.8.Ring of cliques and its resolution limit.(a) Network plot with n = 5 and m = 8, and (b) the resolution limits of the map equation (light blue line), the single-trajectory map equation with λ = (1, 1.25, 1.5) (dark blue lines), and the modularity (dashed grey line).Each line represents the phase boundary above which a clique is not resolved as a module because the network is too large (e.g., the marked region represents the undetectable region in the map equation).
TABLE I.Symbols and descriptions for the average code lengths considered in the map equation and the single-trajectory map equation.Symbol Description L (ζ, σ) Average code length of trajectory ζ with node partition σ L λ (σ; {ζ a }) • Average code length of trajectories {ζ a } with node partition σ and hyperparameter λ • An objective function of the single-trajectory map equation • L (σ; {ζ a }) = L λ=1 (σ; {ζ a }) L λ (σ; {ζ a }) • Lower bound of the average code length of trajectories {ζ a } with node partition σ and hyperparameter λ • An objective function of the single-trajectory map equation • L (σ; {ζ a }) = L λ=1 (σ; {ζ a }) L (σ) (--two-level option in Infomap) • Expected average code length of the random walk with node partition σ • An objective function of the map equation that is mainly considered in the original paper L (σ) (--flow-model rawdir option in Infomap) • Expected average code length of the flow based on the set of transition probabilities induced by the edges under node partition σ • An objective function of the map equation that is implemented in Infomap as a variant S2.ACCURACY OF INFOMAP+ ON THE SBM

3 FIGFIG. S3 .
FiguresS4 and S5  show the performance of the Bayesian Infomap[S12, S13]  on the SBM and real-world networks, corresponding to the experiments performed in the main text.The Bayesian Infomap has a hyperparameter λ that specifies the strength of the prior distribution based on a random network.The default value is λ = (ln N )/N ; in Infomap, the parameter regularisation strength controls the coefficient a in λ = a(ln N )/N .Overall, the Bayesian Infomap is highly sensitive to the choice of λ.As observed in Fig.S4, in many cases, the Bayesian Infomap either leaves many small modules or identifies the whole network as a module.The same tendency was observed for the real-world networks, as shown in Fig.S5.Similar to the experiment in the main text, we increased FIG. S4.Performance of the Bayesian Infomap with different values of the prior parameter λ on the symmetric SBM (N = 1, 000, c = 12).We generated five instances of the SBM for each fuzziness of the module structure and plotted the distribution of the resulting relative module sizes.
FIG. S6.Running times of Infomap (--two-level) and Infomap+ on the SBM.Each point represents the mean running time for five network instances generated from the SBM with eight equally-sized planted modules (c = 12, = 0.1).The running time of each algorithm grows polynomially with N .
table of the average code lengths is shown in Supplementary Table 1 (Section S1).