Mapping flows on hypergraphs

Hypergraphs offer an explicit formalism to describe multibody interactions in complex systems. To connect dynamics and function in systems with these higher-order interactions, network scientists have generalised random-walk models to hypergraphs and studied the multibody effects on flow-based centrality measures. But mapping the large-scale structure of those flows requires effective community detection methods. We derive unipartite, bipartite, and multilayer network representations of hypergraph flows and explore how they and the underlying random-walk model change the number, size, depth, and overlap of identified multilevel communities. These results help researchers choose the appropriate modelling approach when mapping flows on hypergraphs.

Researchers model and map flows on networks to identify important nodes and detect significant communities 1,2,3,4 .From small to large system scales, random walk-based methods help to uncover the inner workings of the systems the networks represent 5,6 .When standard network models fail to adequately represent a system's interactions, researchers turn to higher-order models of complex systems 7,8 , including multilayer networks 9,10,11 for multitype interactions, non-Markovian networks 12,13,14 for multistep interactions, and combinatorial models such as simplicial complexes 15,16,17,18 and hypergraphs 19,20,21,22 with nodes in hyperedges for multibody interactions.
While several methods can identify flow-based communities in multilayer 9,23,24 and memory 12,13,14 networks with non-Markovian dynamics, researchers have just begun to unravel the large-scale systemic effects of multibody interactions captured by hypergraphs 22 .However, different systems and research questions call for different random walk and hypergraph models: Random walks can be lazy, able to visit the same node multiple times in a row, or non-lazy and forced to move on.Hyperedges can have arbitrary weights, and nodes can have hyperedge-dependent weights.Because these and other models can be represented with different network types -bipartite, unipartite, and multilayer -the questions multiply: How do different hypergraph random-walk models combined with different network representations change the flow dynamics at scales captured by communities?
For example, random walks on hypergraphs can model the flow of ideas in co-authorship networks.A node represents an author, and a hyperedge connects all authors of a paper.In the simplest dynamics, a random walker on a node picks a random hyperedge among those that contain the node and steps to a random node of the picked hyperedge.Then repeats.Excluding author self-links for non-lazy walks or including hyperedge weights from paper citations or using hyperedge-dependent node weights for varying author contributions are natural model variations that generate * anton.eriksson@umu.sedifferent dynamics 20,21 .How does the organisation of authors in nested communities from research groups to research areas change with random-walk model and representation?
For lazy random walks on hypergraphs with self-links and hyperedge-independent node weights, random walks on weighted, undirected networks generate equivalent dynamics 20 .Each hyperedge becomes a clique with properly adjusted link weights.This projection enables standard flow-based methods developed for weighted networks to identify communities where random walks stay for a long time.Non-lazy walks or walks with hyperedgedependent node weights require directed networks 20 .A bipartite representation provides hyperedge assignments, and a multilayer representation enables overlapping communities.
Representing hypergraphs with bipartite networks requires weighted, directed links between two sets of nodes: one for the nodes and one for the hyperedges.Picking a random hyperedge becomes an explicit step to a hyperedge node.Non-lazy walks on the hypergraph require non-backtracking walks on the bipartite network 25 .With proper normalisation, the node-visit rates stay the same.Though unipartite and bipartite representations give identical node flows, the bipartite representation's link flows from nodes to hyperedge nodes and back to nodes can induce more flows between communities and alter the optimal community composition.The community-detection algorithm must also assign more nodes, which implies more degrees of freedom and a larger search space.
Multilayer networks represent the hyperedges as layers with fully connected groups of nodes.Each node is present in each of its hyperedge layers.Hyperedge weights become layer weights, and hyperedge-dependent node weights become layer-dependent node weights.Though the node visit rates aggregated over layers remain the same, multilayer networks multiply the degrees of freedom and enable new models.Reducing the inter-layer link weights increases the time a random walker spends within a hyperedge before moving to another.Reducing the inter-layer link weights only between dissimilar layers reinforces flows within similar layers.The search space expands when nodes can belong to multiple overlapping communities.The many combinations of random-walk models and representations available to address specific research problems require us to ask, for different data and different questions, which model and representation is best?
To address which combination of model and representation is best for answering different questions about various hypergraph data, we derive unipartite, bipartite, and multilayer network representations of hypergraph flows with identical node-visit rates for the same random-walk model.For unique node-visit rates when a representation requires directed links, we apply an unrecorded teleportation scheme robust to changes in the teleportation rate and that preserves the node-visit rates when teleportation is superfluous in undirected networks 26 .The information-theoretic and flow-based community detection method Infomap 27 allows us to explore how different hypergraph random-walk models and network representation change the number, size, depth, and overlap of identified multilevel communities.
By analysing schematic and real hypergraphs, we find that the bipartite network representation requires the fewest links and enables the fastest community detection.A multilayer network representation that reinforces flows within similar layers gives the deepest modular structures with the most overlapping communities but at a high computational cost.The unipartite network representation provides a trade-off between the two, with intermediate compactness, speed, and detectable modular regularities.

Results and Discussion
Modelling flows on hypergraphs.We model flows on hypergraphs with random walks, using hypergraphs with nodes , hyperedges  with weights , and hyperedge-dependent node weights .Each hyperedge  has a weight ().Each node  with incident hyperedges  () = { ∈  :  ∈ } has a weight   () for each incident hyperedge .To simplify the notation when normalising weights into probabilities, we denote node 's total incident hyperedge weight  () =  ∈ () () and hyperedge ' total node weight () =  ∈   () 20 .With these weights, a lazy random walker moves from node  at time  to node  at time  + 1 in three steps by 20 and teleporting walks, which jump to a random node at some rate to ensure that all nodes can be reached from any node in a finite number of moves, so-called ergodic walks.We pick the next hyperedge based on its similarity to the previously picked hyperedge in hyperedge-similarity walks, which are useful for modelling flows that tend to stay among similar hyperedges such as among research papers with similar author lists and likely similar topics.These walks require memory and correspond to a higher-order Markov chain model because they depend on the previously picked hyperedge.
The bipartite, unipartite, and multilayer network representations have different advantages and limitations (Fig. 1).A weighted, undirected network suffices for memoryless lazy random walks without hyperedge-dependent node weights, hyperedge-dependent node weights require directed networks, and hyperedge-similarity walks require multilayer networks.
Bipartite networks offer the most direct representation of the three-step random-walk process above.We represent the hyperedges with hyperedge nodes, and the three steps become a two-step walk between the nodes at the bottom and the hyperedge nodes at the top in Fig. 1b.For simplicity, we refer to them as nodes and hyperedge nodes.First a step from a node  to a hyperedge node , and then a step from the hyperedge node to a node , By starting the random walk on the nodes and taking two steps at a time, corresponding to a two-step Markov process 28 , hyperedge nodes are only intermediate stops with zero flow when the random walk is back on the nodes after two steps.The stationary distribution of the random walk is concentrated to the nodes.For non-lazy walks represented with bipartite networks, we use so-called state nodes 27 in the hyperedge nodes.One state node for each incoming link has out-links to all nodes in the hyperedge, except the incoming link's source ensures that the walks are not backtracking (Fig. 2).To represent the random walk on a unipartite network, we project the three-step random-walk process down to a one-step process between the nodes and describe it with the transition rate matrix where  (, ) = { ∈  :  ∈ ,  ∈ } is the set of hyperedges incident to both nodes  and .Each hyperedge forms a fully connected group of nodes (Fig. 1c).Unipartite networks for non-lazy walks have no self-links.Compared with the bipartite representation, the unipartite representation with fully connected groups of nodes requires more links.
To represent the random walk on a multilayer network, we project the three-step random-walk process down to a one-step process on state nodes in separate layers  for each hyperedge .A state node   represents  in each layer  ∈  () that contains the node.All state nodes in the same layer form a fully connected set (Fig. 1d).The transition rate between state node   in layer  and state node   in layer  is for  ∈  (, ).
(4) Node 's state node visit rates in different layers sum to 's visit rate in the unipartite and bipartite representations.With one state node per hyperedge layer that contains the node, the multilayer representation requires the most nodes and links to describe the walk.But this cost comes with benefits: the multilayer representation can describe higher-order Markov chains, which can capture more regularities in the data.
For example, a useful variant of the basic hypergraph random walk is to pick a hyperedge not only proportional to its weight but also proportional to its similarity to the hyperedge picked in the previous step.To include hyperedge-dependent node weight information in the similarity measure, we use one minus the Jensen-Shannon divergence (JSD) between the transition rate vectors P  and P  to nodes at layers  and  as the hyperedge coupling strength, for  ∈  (, ).With node 's total incident hyperedge weight in layer the hyperedge-similarity walk has the transition rates for  ∈  (, ).
Because the transition rates at a node depend on the current layer, the random walks generate non-Markovian dynamics that a unipartite or bipartite network representation cannot capture.
To ensure ergodic node-visit rates, we derived an unrecorded teleportation scheme that leaves the node-visit rates unchanged when teleportation is superfluous for hypergraphs with hyperedgeindependent node weights, robust to changes in the teleportation rate when teleportation is needed 26 , and independent of the representation (see Methods).
Mapping flows on hypergraphs.To identify flow-based communities or modules in hypergraphs, we seek to compress a modular description of random walks on the network representations guided by their links.We cast the problem of finding flow-based communities in hypergraphs as a minimum-description-length problem with the map equation framework 3 .With this compression-based framework, we can compare how much the different representations compress modular flows.
When used to detect communities, the representation matters because bipartite, unipartite, and multilayer networks provide the community-detection algorithm Infomap with different degrees of freedom 27 .Infomap assigns only nodes to communities in a unipartite network, but assigns also hyperedge nodes in a bipartite network.The multilayer network, with a state node for each hyperedge a node belongs to, implies even more node assignments and possibly overlapping communities.
When mapping flows modelled by lazy and non-lazy random walks on the schematic network in Fig. 1, the optimal partitions Table I.Optimal flow-based communities of the schematic hypergraph in Fig. 1 represented with different networks.The number of nodes includes state nodes for the multilevel representations and the bipartite non-lazy representation.We measure the overlap as the perplexity of the optimal solutions (see Methods).

Representation
Nodes of the bipartite networks have two communities, whereas the unipartite and multilayer networks have three communities (Table I and Fig. 3).The bipartite network favours fewer modules -using the optimal three-module partition of the unipartite network on the bipartite network gives code length 3.29 bits instead of 2.90 bits for two modules --because the random walker transitions more frequently between modules when they include hyperedges: Even if a hyperedge node contains no flows at the end of each twostep walk from node through hyperedge node to node, assigning it to a module costs extra bits when it has nodes in multiple modules.For example, if nodes , , and  in the bipartite network in Fig. 1(b) would belong to a third green module as in the optimal unipartite solution, and the random walker at node  would return to the hyperedge it comes from before revisiting node , it would first need to exit the green module and enter the orange module, then exit the orange module and re-enter the green module.The corresponding walk on the unipartite network stays within the green module.As a result, the unipartite network representation favours more, smaller modules than the bipartite network representation for lazy and non-lazy walks (Table I).
Multilayer networks enable further compression with overlapping modules.But for this small network, only non-lazy walks give overlapping modules with 0.01 bits compression gain (Table I).With walks that preferentially move to similar hyperedges, the optimal partitions of the multilayer hyperedge-similarity network representations for lazy and non-lazy random walks both have more overlap in four modules (Table I and Fig. 3).The hyperedgesimilarity walks favour these overlapping modules because they stay longer within them than the regular walks.
For a given random-walk model, the representations give equivalent node-visit rates but alter the link flows, and with different link flows, the optimal partition can change.The bipartite network representation favours partitions with fewer modules than the unipartite network representation because assigning hyperedge nodes to modules implies encoding more transitions between modules.Experiments.To illustrate how the network representation affects detected communities in real hypergraphs, we generated a collaboration hypergraph from the 734 references in Networks beyond pairwise interactions: Structure and dynamics by F. Battiston et al. 8 We modelled the referenced articles as hyperedges and their authors as nodes.Authors with multiple articles form connections between the hyperedges.We analysed the largest connected component with | | = 361 author nodes in | | = 220 hyperedges.

Multilayer representations, especially with walks that spend longer
The median number of authors in a hyperedge is 3, and the authors have contributed to 2.2 articles on average though most have only contributed to one.We assigned the relative importance of references by their number of citations  in December 2020.Some references had no citations and some were highly cited.One such example is Diffusion of innovations by Everett M. Rogers, with more than 120, 000 citations.To avoid disproportionally large or small hyperedge weights (), we weighted the edges by the logarithm of the number of citations and added unit constants to avoid the zero citation problem, () = ln ( + 1) + 1. (8)   We modelled the authors' different contributions to articles by assigning higher weights to the first and last author 20 .We used We assumed equal contribution for alphabetically sorted authors, and assigned all of them weight () = 1.This model ranks a co-corresponding contributions lower than those of the corresponding authors.
To study how hypergraph representations and random-walk models affect the community structure, we generated bipartite, unipartite, and multilayer representations for lazy and non-lazy random walks on the collaboration network.We identified nested hierarchical partitions in each network with Infomap, using 100 independent searches for each network.Infomap's running time depends on the number of nodes, links, and solution levels: The bipartite and unipartite representations finished 3-7 times faster than the multilayer representations.The non-lazy bipartite representation with many state nodes ran almost as long.
The optimised partitions for the lazy and non-lazy representations behave like the schematic example: The bipartite representations have the fewest leaf modules and highest codelengths, and the multilayer hyperedge-similarity representations have the most leaf modules and shortest codelengths, with the unipartite and the regular multilayer representations in between (Table II).Except for the non-lazy bipartite representation with its many state nodes, the lazy representations have more leaf modules and shorter code lengths than their corresponding non-lazy representations because the lazy random walk is more confined than the non-lazy random walk.
With more nodes than in the schematic example, the solutions have more depth.The bipartite solutions have three, and the unipartite and multilayer solutions have four hierarchical levels.The unipartite and multilayer solutions also have more top modules.With non-lazy dynamics, they split the largest top module, and in the lazy dynamics, they split the two largest top modules.But the second-largest top module reunites in the hyperedge-similarity representation, with stronger connections between similar hyperedges (Fig. 4 and Fig. 7 in Appendix A).The unipartite and multilayer solutions are also most similar at the leaf level (Fig. 8 in Appendix A).
In this larger example, the multilayer hyperedge-similarity representations give more overlap.The non-lazy representations result in higher average overlap because random walkers visiting a node must continue to other nodes, often in the same or a similar hyperedge layer.When random walkers from dissimilar hyperedges come together at a node, they tend to return to where they came from and favour overlapping modules.The non-lazy representations also result in higher max overlap with the same authors topping all representations (Fig. 5).
In line with the information-theoretic duality between finding regularities in data and compressing those data, representations that enable deeper solutions with more modules have shorter codelengths (Table II).The lazy multilayer representation is an exception.Its optimised codelength is bound above by the lazy unipartite representation's codelength -they have the same codelength for the same hard partition -and overlapping modules can potentially reduce the codelength.Infomap's best codelength was instead 0.05 percent longer than for the lazy unipartite representation.Multilayer representations with their many state nodes and links aggravate the search problem, and Infomap could not find a better solution in 100 attempts.But the gain from overlapping modules is higher for the non-lazy multilayer representation and Infomap finds a solution with a significantly shorter codelength.
A case study on fossil data.Palaeontologists classify major groups of marine animals archived in the fossil record into globalscale faunas that change over time 29 .They have used different network representations to understand the macroevolutionary pattern of marine biodiversity 30,31 .However, it is still unclear how such an organisation of marine animals into modules representing global faunas changes with random-walk model and network representation.To illustrate how the network representation of the underlying paleontological data affects empirical estimates of this macroevolutionary pattern, we generated a hypergraph from genus-level fossil occurrences presented in ref. 30 and retrieved from the PaleoDB 32 .We restricted our analysis to fossil occurrences from the Cambrian (541 MY) to the Cretaceous period (66 MY) and modelled 77 geological stages as hyperedges and 13,276 genera as nodes.Genera occurring in multiple geological stages connections between hyperedges.We weighted the hyperedges by dividing the number of samples where a genus occurs in a given geological stage by the total number of samples recorded at the stage, a procedure modified from ref. 33.We generated bipartite, unipartite, and multilayer network representations for lazy and non-lazy random walks from the underlying palaeontology data and identified optimised partitions in the assembled networks using Infomap.
For lazy random walks, Infomap partitioned only the multilayer representations into multilevel communities: three modules at the first hierarchical level [Fig.6(a)].Similar to the schematic example and the collaboration hypergraph, the bipartite representation for the lazy random walks has the fewest leaf modules and the highest codelength.The multilayer hyperedge-similarity representation has the most leaf modules and the shortest codelength (Table III).
For non-lazy random walks, Infomap partitioned the bipartite representation into a multilevel solution with shorter codelength than the unipartite representation and the standard multilevel representation [Fig.6(b)].The multilayer hyperedge-similarity representation once more provides the most leaf modules and the highest overlap.
The multilayer network representations, including lazy and non-lazy random walks, reproduce modules reminiscent of the Cambrian, Paleozoic, and modern evolutionary faunas widely used in macroevolutionary research 29 .Also, leaf modules in the multilayer representations capture subfaunas from specific geological periods as nested modules such as Silurian, Triassic, Jurassic, and Cretaceous.Infomap applied to the bipartite representation of the non-lazy random walks identified similar subfaunas but combined Cambrian and Paleozoic faunas into a single top module, obscuring the large-scale pattern.Overall, our results indicate some advantages of using multilayer over bipartite and unipartite representations of fossil occurrence data to quantify the marine biodiversity's macroevolutionary patterns, with lazy and non-lazy random walks providing similar solutions.

Conclusions
We have derived unipartite, bipartite, and multilayer network representations of hypergraph flows with different advantages.We used the information-theoretic and flow-based community detection method Infomap to explore how different hypergraph random-walk models and network representation change the number, size, depth, and overlap of identified multilevel communities.By identifying flow-based communities both in a schematic and real hypergraphs -a small collaboration hypergraph of researchers working on networks beyond pairwise interactions and a large faunal hypergraph of sampled species across geological stages -we found that the bipartite network representation is the most compact and enables the fastest community detection.A multilayer network representation that reinforces flows within similar layers -one for each hyperedge -gave the deepest modular structures with the most module overlap.But the modular detection gain comes at a high computational cost: Combining fully connected layers with other layers requires many more nodes and links than in the bipartite network representation.If the research question does not require hyperedge assignments or overlapping modules, the unipartite network representation provides a trade-off with intermediate compactness, speed, and the ability to reveal modular regularities.Among the random-walk models, lazy walks typically give more modules in deeper nested structures, and non-lazy walks provide higher modular overlap.Our methods and results help researchers model and map flows on hypergraphs to study the effects of multibody interactions in complex systems.

Methods
Unrecorded teleportation.With hyperedge-independent node weights where   () = () for all hyperedges  ∈  (), undirected weighted networks can represent the dynamics, and the stationary distribution of the random walk   is proportional to the product of node 's total incident hyperedge weight  and weight ().With normalised node-visit rates 20 , For the multilayer network representation, the node-visit rates split between layers based on the node 's incident hyperedge weight per layer state node With hyperedge-dependent node weights   (), only directed weighted networks can represent the dynamics.We use random teleportation to ensure ergodic walks when deriving the node-visit rates with the power-iteration method.Unrecorded teleportation to links minimises the distortion 26 : In each iteration of the poweriteration method, we distribute a fraction  = 0.15 of each node's flow volume among all nodes proportional to their out-link weights.
The remaining flow volume moves on the links proportional to their weights.In the last iteration, we move all flows on the links proportional to their weights and record all flows on links and nodes to obtain the ergodic node-and link-visit rates with unrecorded teleportation.This procedure gives equivalent visit rates as simulating a random walker that only records moves on links: With probability 1 − , the random walker moves to a node by following the links proportional to their weights and records the link and the target node.With probability , the random walker teleports without recording to the link's source node proportional to the link weight.The normalised number of recordings of each node and link gives the visit rates.We want teleportation applied to undirected networks -where it is unnecessary -to leave the node-and link-visit rates unchanged.We achieve this smooth teleportation by scaling the transition rates from nodes by the node-visit rates: Then unrecorded teleportation proportional to the nodes' total out-link weights followed by recorded moves on the links proportional to their weights distributes on the nodes according to the ergodic visit rates on undirected networks 26 .For the general case when the node weights can depend on the hyperedge, and the network may be directed, we use Eq. 10 without assuming   () = () as an approximation of the node-visit rates: for state nodes.With exact node-visit rates, we would obtain the stationary flow volumes on links by multiplying the transition rates by the source nodes' visit rates.With approximate node-visit rates, instead, we obtain the link weights for bipartite networks, for unipartite networks, and for multilayer networks.With unrecorded teleportation proportional to these link weights, modelling flows on hypergraphs give node-visit rates robust to changes in the teleportation rate and independent of the representation.
Overlap metric.Modules overlap when Infomap assigns a node's state nodes in the multilayer network representations to different modules.Measuring the overlap through the absolute number of assignments is misleading because the overlap is 2 regardless of the number of state nodes assigned to a different module than the rest.Instead, we used the effective number of assignments.If a fraction  of node 's state nodes is assigned to the th module in 's module assignment set, the th element of 's assignment vector is    =  and the effective number of assignments measured by the perplexity of 's module assignments is The effective number of assignments is one if all 's state nodes are in one module, and it is equal to the number of assignments when the state nodes are divided evenly among 's module assignments.
We averaged over all nodes for the partition overlap.

Fig. 1 .
Fig. 1.A schematic hypergraph represented with three types of networks.(a) The schematic hypergraph with weighted hyperedges and hyperedgedependent node weights.Thin borders for weight 1 and thick borders for weight 3. A lazy random walk on the schematic hypergraph represented on: (b) a bipartite network, (c) a unipartite network, and (d) a multilevel network.The colours indicate optimised module assignments, in (d) for hyperedgesimilarity walks.

Fig. 2 .
Fig.2.Bipartite network with state nodes for non-lazy random walks.To prevent random walks on bipartite networks from visiting the same node at the bottom twice in a row by backtracking from the hyperedge node at the top, we use state nodes in the hyperedge nodes.Each hyperedge node requires one state node for each node in the hyperedge.The state nodes have one incoming link from its source node and outgoing links to all other nodes in the hyperedge.Colours indicate the optimised partition in Fig.3(b).

Fig. 3 .
Fig. 3. Alluvial diagrams of optimal partitions for the schematic hypergraph in Fig. 1.(a) Optimal partitions for lazy walks represented with the networks in Fig. 1(b-d).(b) Optimal partitions for non-lazy walks.

Fig. 4 .
Fig. 4. Alluvial diagrams of optimised partitions for different representations of the collaboration hypergraph .Lazy walks in (a) and non-lazy walks in (b).Module names from the top-ranked author within each module.

Fig. 5 .
Fig. 5. Authors in the collaboration hypergraph with the highest average effective number of assignments in the lazy and non-lazy multilayer representations (see Methods).

Fig. 6 .
Fig.6.Alluvial diagrams of optimised partitions for the fossil hypergraph represented with different networks.Lazy walks in (a) and non-lazy walks in (b).We show top modules when a partition lacks deeper levels and leaf modules marked with dashed lines when they exist.Module names from the geological period or era represented by the fauna assemblage.

7 .
Hierarchical maps of the collaboration hypergraph using (a) the bipartite representation and (b) the multilayer hyperedge-similarity representation.Module colours are the same as in Fig.4(a).Aggregated inter-module links with sizes proportional to the exiting flow volume and length inversely proportional to the flow volume.White sub-modules are labelled with the top-ranked author.The largest blue top module in (a) contains ten sub-modules.In (b), the partition assigns those nodes to five top modules containing more sub-modules.S. Boccaletti, one of the most overlapping authors and highlighted in red, is assigned to one module in (a) and three top modules and six sub-modules in (b).

Table II .
Optimised flow-based multilevel communities of the collaboration hypergraph represented with different networks.The number of nodes includes state nodes for the multilevel representations and the bipartite non-lazy representation.Shortest codelength of 100 trials with the variance in parenthesis.We measure the overlap as the perplexity of the optimised solutions (see Methods).
a hyperedge-similarity the edge-dependent node weights  () = 2 if node  is first or last author, 1 otherwise.

Table III .
Optimised flow-based multilevel communities of the fossil hypergraph represented with different networks.The number of nodes includes state nodes for the multilevel representations and the bipartite non-lazy representation.The number of non-trivial top and leaf modules.Average number of levels weighted by the flow volume.We measure the overlap as the perplexity of the optimised solutions (see Methods).Shortest codelength of 20 trials with the variance in parenthesis.