Learning low-rank latent mesoscale structures in networks

Researchers in many fields use networks to represent interactions between entities in complex systems. To study the large-scale behavior of complex systems, it is useful to examine mesoscale structures in networks as building blocks that influence such behavior. In this paper, we present an approach to describe low-rank mesoscale structures in networks. We find that many real-world networks possess a small set of latent motifs that effectively approximate most subgraphs at a fixed mesoscale. Such low-rank mesoscale structures allow one to reconstruct networks by approximating subgraphs of a network using combinations of latent motifs. Employing subgraph sampling and nonnegative matrix factorization enables the discovery of these latent motifs. The ability to encode and reconstruct networks using a small set of latent motifs has many applications in network analysis, including network comparison, network denoising, and edge inference.

It is often insightful to examine structures in networks [40] at an intermediate scale (i.e., at a "mesoscale") that lies between the microscale of nodes and edges but below macroscale distributions of local network properties.There are a large variety of network mesoscale structures of networks, including community structure [9,47], core-periphery structure [49], and role structures [1].We are interested in mesoscale network structures that are large enough that it is reasonable to discuss their collective properties, but that are also small enough so that we can also discuss their statistical properties.In this paper, we examine mesoscale network structures that we obtain using k-node induced subgraphs of networks.These networks have k nodes that inherit their adjacency structures from the original networks from which we draw them.Because most real-world networks are sparse [40], independently choosing a set of k nodes from a network may not return meaningful information.Instead, we use motif sampling [29]: we first uniformly randomly sample a set of k nodes that form a path (this set is called a 'k-chain motif'), and we then obtain the subgraph that is induced by that k-chain motif by including all of the edges between those k nodes.This guarantees that we sample a connected subgraph of a network while assuming very little about the structure of the original network; by repeating this process, we obtain a data set of 'mesoscale patches' of a network.We then use 'dictionary-learning' algorithms [32] to learn mesoscale structures of networks that we call 'latent motifs'.We then use latent motifs to infer subgraph structures of networks, compare different networks, and denoise corrupted networks.
Dictionary-learning algorithms are machine-learning techniques that learn interpretable latent structures of complex data sets and are applied regularly in the data analysis of text and images [7,34,46].Such algorithms usually consist of two steps.First, one samples a large number of 2 LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS structured subsets of a data set (e.g., square patches of an image or collections of a few sentences of a text); we refer to such a subset as a mesoscale patch of a data set.Second, one finds a set of basis elements such that taking a nonnegative linear combination of them can successfully approximate each of the sampled mesoscale patches.Such a set of basis elements is called a dictionary, and one can interpret each basis element as a latent structure of the data set.As an example, consider the image of the artwork Cycle by M. C. Escher in Figure 1a.We first sample 10,000 square patches of 21 × 21 pixels, and we then use a nonnegative matrix factorization (NMF) [19] algorithm to find a dictionary with r = 25 square patches (see Figure 1a).Each element of the learned dictionary describes a latent shape in the image.
Algorithms for network dictionary learning (NDL) [30] use a similar idea.As mesoscale patches of a network, we use the k-node subgraphs that are induced by motif sampling.We represent these subgraphs using their k ×k adjacency matrices.After obtaining sufficiently many mesoscale patches of a network, we apply NMF to learn a dictionary for the latent adjacency matrices, which we call latent motifs of the network.We give a complete implementation of our approach in Algorithm 1 of the Supplementary Information (SI).See our SI for more details, including the theoretical guarantees for Algorithm 1 in Theorems G.2 and G.5.In panels (b) and (c), we show both heat maps and adjacency matrices.We take the columns of X to be the k × k adjacency matrices of the connected subgraphs that are induced by a walk of k = 21 nodes, where a walk of k nodes consists of k nodes x1, . . ., x k such that xi and xi+1 are adjacent for all i ∈ {1, . . ., k}.Using nonnegative matrix factorization (NMF), we compute an approximate factorization X ≈ W H into nonnegative matrices, where W has r = 25 columns.Because of this factorization, we can approximate any sampled mesoscale patches (i.e., the columns of X) of an object by a nonnegative linear combination of the columns of W , which we can interpret as latent shapes for images and latent motifs (i.e., subgraphs) for networks, respectively.The network dictionaries of latent motifs that we learn from the (b) UCLA and (c) Caltech Facebook networks reveal distinctive social structures.For example, if we uniformly sample a chain of 21 friends in one of these networks, we observe for Caltech that there are likely to be communities with six or more nodes and also some 'hub' users who know most of the others in the sample.However, for UCLA, it is unlikely that there are such communities or hubs.In the heat map of the UCLA network, we show only the first 3000 nodes according to the node labeling in the data set.

Network Dictionary Learning
In Figure 1, we demonstrate the NDL method for networks of Facebook friendships (which were collected on one day in fall 2005) from UCLA ("UCLA") and Caltech ("Caltech") [55,56].Each node in one of these networks is a Facebook account of an individual, and each edge encodes a Facebook friendship between two individuals.Using NDL, we learn 25 latent motifs from each of these two social networks using a chain motif with k = 21 nodes.In each sample, the two diagonal lines in the latent motifs in Figure 1 correspond to the edges along with the k-chain motif.We refer to the remaining entries as 'off-chain' entries; one learns these from subgraphs that are induced by the chain motif.The latent motifs in UCLA's dictionary (see Figure 1b) have sparse off-chain connections with a few clusters, whereas Caltech's dictionary (see Figure 1c) has relatively dense off-chain connections.Such latent motifs reveal distinct social structures in the two networks.For example, if we uniformly sample a chain of 21 friends in one of these networks, we observe for Caltech that there are likely to be communities with six or more nodes and also some 'hub' users who know most of the others in the sample.However, for UCLA, it is unlikely that there are such communities or hubs.See Figure 3.  Latent motifs that we learn from 14 networks (eight real-world networks and six synthetic networks, which include two distinct instantiations from each of three random-graph models) at five different scales (specifically, for k = 6, 11, 21, 51, 101), which reveal distinct mesoscale structures in the networks.Using network dictionary learning (NDL), we learn network dictionaries of r = 25 latent motifs of k nodes for each of the 14 networks.For each network at each scale, we show only the second-most dominant latent motifs from each dictionary.These motifs include more information than the most dominant motifs for these sparse networks.Black squares represent 1 entries and white squares represent 0 entries.See Section D.2 in our SI for details of how we measure latent motif dominance, and see Figure 6 in our SI for the most dominant latent motifs of each network.
We examine mesoscale structures of eight real-world networks and six synthetic networks using NDL.The real-world networks are Facebook networks from Caltech, UCLA, Harvard, and MIT [55,56], SNAP Facebook (which we also denote as SNAP FB as a shorthand) [11,24], arXiv ASTRO-PH (with a shorthand of arXiv) [11,23], Coronavirus PPI (with a shorthand of Coronavirus) [10,44,52], and Homo sapiens PPI (with a shorthand of H. sapiens) [11,44].The first four networks are 2005 Facebook networks from four universities from the Facebook100 data set [56].The fifth network is a 2012 Facebook network that was collected from survey participants [24].The sixth network is a collaboration network based on coauthorship of preprints that were posted in the astrophysics category of the arXiv preprint server.The seventh network is a protein-protein interaction (PPI) network of proteins that are related to the coronaviruses that cause Coronavirus disease 2019 (COVID- 19), Severe Acute Respiratory Syndrome (SARS), and Middle Eastern Respiratory Syndrome (MERS) [52].The eighth network is a PPI network of proteins that are related to Homo sapiens [44].
For the six synthetic networks, we generate two instances each of Erdős-Rényi (ER) G(N, p) networks [8], Watts-Strogatz (WS) networks [57], and Barabási-Albert (BA) networks [2].These are three types of well-studied random-graph models [40].Each of the ER networks has 5000 nodes, and we independently connect each pair of nodes with probabilities of p = 0.01 (in the network that we call ER 1 ) and p = 0.02 (in ER 2 ).For the WS networks, we use rewiring probabilities of p = 0.05 (in WS 1 ) and p = 0.1 (in WS 2 ) starting from a 5000-node ring network in which each node is adjacent to its 50 nearest neighbors.For the BA networks, we use m = 25 (in BA 1 ) and m = 50 (in BA 2 ), where m denotes the number of edges of each new node when it connects (via linear preferential attachment) to the existing network, which we grow from an initial network of m individual nodes (i.e., none of them are adjacent to each other) until it has 5000 nodes.See Section F.1 in our SI for more details.
One can interpret the size (i.e., number of nodes) k of a chain motif as a scale parameter.A k × k mesoscale patch of a network that one obtains by using the k-chain motif encodes connectivity between nodes that are at most k − 1 edges apart in the network.In Figure 2, we show the secondmost dominant latent motif (see Section D.2 and Figure 6 in our SI) that we learn from each of 14 networks -eight real-world networks and six synthetic networks -at various scales (specifically, for k = 6, 11, 21, 51, 101) when we use a dictionary with r = 25 latent motifs.The latent motifs differ drastically both across the different networks and across different scales.
Suppose that we are given a network G and a dictionary W of latent motifs at scale k, where we may or may not learn W from G. Consider the following two scenarios.In one scenario, we suppose that we know G exactly, and we ask how to measure the 'effectiveness' of W in approximating mesoscale patches of G at scale k.In the other scenario, we suppose that G is a noisy version of some true network G true and that W is 'faithful' in the sense that it can well-approximate mesoscale patches of G true at scale k. (See (6) in the SI for the precise definition.)We then ask how we can infer the true network G true from G and W .
To examine the above questions, we develop an algorithm that we call network denoising and reconstruction (NDR) (see Algorithm 2) that takes a network G and network dictionary W as input and outputs a weighted network G recons that has the same node set as G.The NDR algorithm repeatedly (until convergence) samples mesoscale patches of G at scale k, finds a nonnegative linear approximation of them using the latent motifs in W .Because each edge e of G can appear in multiple mesoscale patches of G, there may be multiple reconstructed weights for e in this procedure.We take their mean for the final weight of e in G recons (see Algorithm 2).To measure the effectiveness of W for an unweighted network G (i.e., edges are either present or absent and there are no multi-edges), one can threshold the weighted edges of G recons at some fixed level θ ∈ [0, 1] to obtain an undirected reconstructed network G recons (θ) with binary edge weights (of either 0 or 1), which one can then compare directly with the original unweighted network G.We regard W as effective at describing G at mesoscale k if G recons (θ) is close to G for some θ.(We will quantify our notion of 'closeness' in the next paragraph.)We interpret the edge weights in G recons as measures of confidence of the corresponding edges in G with respect to W .For example, if an edge e has the smallest weight in G recons , we regard it as the most 'outlying' with respect to the latent motifs in W . See Theorems G.7 and G.10 in the SI for theoretical guarantees and error bounds for NDR.
In Figure 3, we show various reconstruction experiments using several real-world networks and synthetic networks.We perform these experiments for various values of the edge threshold θ and r = 9, 16, 25, 36, 49, 81, 100 latent motifs in a single dictionary.Each network dictionary in Figure 3 uses a chain motif with k = 21 nodes, for which the dimension of the space of all possible mesoscale patches (i.e., the adjacency matrices of the induced subgraphs) is 21  2 − 20 = 200.An important Self-reconstruction and cross-reconstruction accuracy between several real-world and synthetic networks versus the edge threshold value θ and the number r of latent motifs in a network dictionary.The label X ← Y indicates that we reconstruct network X using a network dictionary that we learn from network Y .The reconstruction process produces a weighted network that we turn into an unweighted network by thresholding the edge weights at a threshold value θ, such that we keep only edges whose weight is strictly larger than θ.We measure reconstruction accuracy by calculating the Jaccard index of the original network's edge set and the reconstructed network's edge set.In panel (a), we plot accuracies versus θ (keeping the number of latent motifs fixed at r = 25), where X is one of five real-world networks (two PPI networks, two Facebook networks, and one collaboration network).In panels (b) and (c), we reconstruct each of the four Facebook networks using network dictionaries with r ∈ {9, 16, 25, 36, 64, 81, 100} latent motifs that we learn from one of ten networks (with the edge threshold value fixed at θ = 0.5).
observation is that one can reconstruct a given network using an arbitrary network dictionary, which one can even learn from a different network.Such a 'cross-reconstruction' allows us to quantitatively compare the learned mesoscale structures of different networks.We label each subplot of Figure 3 with Y ← X to indicate that we are reconstructing network Y using a network dictionary that we learn from network X.We turn the weighted reconstructed networks into unweighted networks by thresholding their edges using some threshold θ ∈ [0, 1].We measure the reconstruction accuracy by calculating the Jaccard index between the original network's edge set and the reconstructed network's edge set.That is, to measure the similarity of two edge sets, we calculate the number of edges in the intersection of these sets divided by the number of edges in the union of these sets.We obtain the same qualitative results as in Figure 3 if we instead measure similarity using the Rand index [48]).
In Figure 3a, we plot the accuracy for 'self-reconstruction' X ← X versus θ (with r = 25), where X is one of the real-world networks Coronavirus, H. sapiens), SNAP FB, Caltech, and arXiv.The accuracies for H. sapiens and Caltech peak above 95% when θ ≈ 0.4, the accuracies for arXiv and SNAP FB peak above 82% for θ ≈ 0.6, and the accuracy for Coronavirus peaks above 95% near θ = 0.5.We choose θ = 0.5 for the cross-reconstruction experiments for the Facebook networks Caltech, Harvard, MIT, and UCLA in Figures 3b,c.We observe that these four Facebook networks have self-reconstruction accuracies above 75% using r = 25 motifs with θ = 0.5.The total number of dimensions when using mesoscale patches at scale k = 21 is 200, so this result suggests that all of these nine real-world networks have low-rank mesoscale structures at scale k = 21.
We consider accuracies for cross-reconstruction Y ← X in Figures 3b,c, where Y is one of the Facebook networks Caltech, Harvard, MIT, and UCLA and X (with X = Y ) is one of the these four network or one of the six synthetic networks ER i , WS i , or BA i (with i ∈ {1, 2}).From the cross-reconstruction accuracies and interpreting the latent motifs (see Section B.4 in the SI) in Figures 1 and 2 (see also Figures 8,10, and 14 in the SI), we draw the following conclusions at scale k = 21.First, the mesoscale structure of Caltech is distinct from those of Harvard, UCLA, and MIT.This is consistent with prior studies of these networks (see, e.g., [15,56]).Second, Caltech's mesoscale structure at scale k = 21 has a higher dimension than those of the other three universities' Facebook networks.Third, Caltech has a lot more communities of size at least 10 than the other three universities' Facebook networks.Fourth, both BA networks capture the mesoscale structures of MIT, Harvard, and UCLA at scale k = 21 better than the synthetic networks that we generate from ER and WS models.For instance, the self-reconstruction accuracies in Figures 3b,c using r = 9 latent motifs are about 64% for Caltech and 80% or above for the other three universities' Facebook networks.See Section F.5 in the SI for further discussion.
In Figure 4, we use our algorithms to perform two types of network denoising.We can think of them as distinct binary classification problems in network analysis: network denoising with subtractive noise (which is often called edge 'prediction' [12,18,26,28,37] and network denoising with additive noise [4].In each scenario, we suppose that we are given an observed network G = (V, E) with node set V and edge set E and are asked to find an unknown network G = (V, E ) with the same node set V but a possibly different edge set E .We interpret G as a corrupted version of a 'true network' G that we observe with some uncertainty.One can interpret edges and non-edges in G as 'false relations' and 'false non-relations', respectively.In the subtractive-noise setting, we assume that G is a partially observed version of G (i.e., E E ), and we seek to classify all non-edges in G into 'positives' (i.e., edges in G ) and 'negatives' (i.e., non-edges in G ).In the additive-noise setting, we suppose that G is a corrupted version of G to which some unknown edges have been added (i.e., E ⊇ E ), and we seek to classify all edges in G into 'positives' (i.e., edges in G ) and 'negatives' (i.e., non-edges in G ).
To experiment with these problems, we use the following four real-world networks: Caltech, SNAP FB, arXiv Coronavirus, and H. sapiens.Given a network G = (V, E), our experiments proceed as follows.In the subtractive-noise setting, we create two smaller networks by removing uniformly random subsets that consist of 10% (in one experiment) or 50% (in the other) of the edges from our network.In the additive-noise case, we create two corrupted networks by adding edges between node pairs that we choose independently with a fixed probability so that 10% or 50% of the edges in a corrupted network are new.We then apply NDL with r = 25 latent motifs at scale k = 21 to learn a network dictionary for each of these four networks, and we use each dictionary to reconstruct the network from which it was learned using NDR.The reconstruction algorithms output a weighted network G recons , where the weight of each edge is our confidence that the edge belongs to that network.For denoising subtractive (respectively, additive) noise, we classify each non-edge (respectively, edge) in a corrupted network as 'positive' if its weight in G recons is strictly larger than some threshold θ and 'negative' otherwise.By varying θ, we construct a receiver-operating characteristic (ROC) curve that consists of points whose horizontal and vertical coordinates are the false-positive rates and true-positive rates, respectively.
In Figure 4, we show the ROC curves and corresponding area-under-the-curve (AUC) scores for our network-denoising experiments with subtractive and additive noise for the four networks.For example, when we add 50% false edges (with one extra edge as a tie-breaker, given the odd  In panel (f ), we compare our denoising results for −50% noise on SNAP FB, H. sapiens, and arXiv versus those of other methods.In our experiments with subtractive noise, we corrupt a network by removing a uniformly random subset of 10% or 50% of its edges, and we seek to classify the removed edges and the non-edges as true edges and false edges, respectively.In our experiments with additive noise, we corrupt a network by uniformly randomly adding 10% or 50% of the number of its edges, and we seek to classify the edges and non-edges in the resulting corrupted network as 'negative' (i.e., false edges) or 'positive' (i.e., true edges).To perform classification in a network, we first use NDL to learn latent motifs from a corrupted network and then reconstruct the networks using NDR to assign a confidence value to each potential edge.We then use these confidence values to infer membership of potential edges in the uncorrupted network.Importantly, we never use information from the original networks.For the Caltech Facebook network in panel (b), we also perform edge inference and denoising using a network dictionary that we learn from the MIT Facebook network.For each network, we indicate the receiver-operating characteristic (ROC) curves and corresponding area-under-the-curve (AUC) scores for network denoising with additive noise using the labels +10% and +50%, and we indicate the ROC curves and corresponding AUC scores for network denoising with subtractive noise using the labels −10% and −50%.
number of edges in the original network) to Coronavirus, such that 2,463 edges are true and 1,232 edges are false, we are able to detect over 90% of the false edges while misclassifying only 10% of the true edges.In Figure 4f, we compare the performance of our method to those of some popular supervised algorithms that are based on network embeddings.Specifically, we compare to node2vec [11], DeepWalk [45], and LINE [50] for the task of denoising 50% subtractive noise for SNAP FB, H. Sapiens, and arXiv.Our method achieves state-of-the-art results in all cases.We make two important remarks about applying the NDL and NDR algorithms to network denoising.First, NDL and NDR are able to perform the desired classification tasks in an unsupervised manner, in the sense that we do not require fully known examples to train our algorithm.This is particularly useful when it is difficult to obtain a large number of fully known examples, such as measuring PPI networks for a new organism (e.g., SARS-CoV-2).Part of the reason that NDL and NDR are successful at these network-denoising tasks is because of the low-rank nature of the examined mesoscale structures of the social and PPI networks (see Figure 3a).Specifically, because NDL learns a small number of latent motifs that are able to successfully give an approximate basis for all mesoscale patches, they should not be affected significantly by noise.Second, unlike the existing algorithms that we just mentioned, we are able to perform denoising on a network using information that we learn from an entirely different network.Consequently, we are able to successfully perform not only self-reconstruction but also cross-reconstruction.For instance, in Figure 4, we show the results for denoising Caltech using a dictionary that we learn from the MIT network, which we expect to have a similar structure to Caltech based on the results of the experiments that we highlighted in Figure 3 and also on prior research on these networks [14,43,56].
Our experiments in Figures 3 and 4 illustrate that various social, collaboration and PPI networks have low-rank [36] mesoscale structures, in the sense that a few (e.g., r = 25, but see Figure 3 for other choices of r) latent motifs that we learn using NDL are able to reconstruct, infer, and denoise the edges in the entire networks by employing the NDR algorithm.We hypothesize that such low-rank mesoscale structures are a general phenomenon for networks of interactions in various complex systems beyond the social, collaboration, and PPI networks that we have examined.As we have illustrated in this paper, one can leverage mesoscale structures to perform important tasks like network denoising, so it is important in future studies to explore the level of generality of our insights.

Supplementary Information: LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS
Appendix A. Overview In this supplementary material, we present our algorithms for network dictionary learning (NDL) and network denoising and reconstruction (NDR), and we prove theoretical results about their convergence and error bounds.We give the full NDL algorithm (see Algorithm 1) in Section D, and we give the full NDR algorithm (see Algorithm 2) in Section E. We introduce the notion of 'latent motif dominance' in Section D.2 to measure the significance of each latent motif that we learn from a network.In Section G, we give a rigorous analysis of the NDL and NDR algorithms.
Appendix B. Problem Formulation for Network Dictionary Learning (NDL) B.1.Definitions and notation.To facilitate our discussions, we use terminology and notation from [27,Ch. 3].In the main text, we described a network as graph G = (V, E) with node set V and edge set E without directed or multi-edges, but possibly with self-edges.One can characterize the edge set E of G as an adjacency matrix A G : V 2 → {0, 1}, where A(x, y) = 1({x, y} ∈ E) for each x, y ∈ V .The function 1(S) denotes the indicator of the event S; it takes the value 1 if S occurs and 0 if it does not occur.In this supplementary information, for broader applicability, we formulate our NDL framework in the more general setting in which edge in a network may have weights.Although one can extend the above definition of networks to include weighted edges by adjoining an additional entry to G = (V, E) for edge weights, simply extending the range of adjacency matrices from {0, 1} to the interval [0, ∞) is convenient.
With the above considerations in mind, we define a network as a pair G = (V, A G ) with node set V and a weight matrix (which is also often called a 'weighted adjacency matrix') A G : V 2 → [0, ∞) that encodes the interaction strengths between the nodes.For simplicity, we will often drop the subscript G in A G and denote it as A. A given graph G = (V, E) determines a unique network G = (V, A G ), where A G is the adjacency matrix of G.The set V (G) is the node set of the network G, which has size |V (G)|, where |S| is the number of elements in the set S. A pair (x, y) of nodes in G is called a directed edge if A(x, y) > 0. We say that a network G = (V, A) is symmetric if its weight matrix is symmetric (i.e., A(x, y) = A(y, x) for all x, y ∈ V ) and we say that it is binary if A(x, y) ∈ {0, 1} for all x, y ∈ V .Given a symmetric network G = (V, A), we call an unordered pair {x, y} of nodes in G an edge if A(x, y) = A(y, x) > 0. We say that a network G = (V, A) is connected if for any two distinct nodes x, y ∈ V , there exists a sequence of nodes x 0 , x 1 , . . ., x m for some m ≥ 1 such that A(x i , x i+1 ) > 0 for all i ∈ {0, . . ., m − 1} and {x 0 , x m } = {x, y}.We call G bipartite if it admits a 'bipartition', which is a partition Suppose that we are given m elements v 1 , . . ., v m in some vector space.By their mean, we refer to their sample mean v = m −1 m i=1 v i .By their weighted average, we refer to the expectation m i=1 v i p i , where (p 1 , . . ., p m ) is a probability distribution on the set of m elements.B.2. Homomorphisms between networks and motif sampling.Being able to sample from a complex data set according to a known probability distribution (e.g., a uniform one) is a crucial ingredient in dictionary-learning problems.This is often the case for image-processing applications [7,33,35], as it is straightforward to sample a k × k patch uniformly at random from an image.However, the similar problem of uniformly randomly sampling a k-node connected subnetwork from a network is not straightforward [3,16,22,58].For our purpose of developing dictionary learning for networks, we use motif sampling, which was introduced recently in [29].In motif sampling, instead of directly sampling a connected subnetwork, one samples a random node map from a smaller network (i.e., a motif) into a target network that preserves adjacency relations, and one then uses the subnetwork that is induced on the nodes in the image of the node map.As we discuss below, such a node map between networks is a homomorphism.
Fix an integer k ≥ 1 and a weight matrix , where we use the shorthand notation [k] = {1, . . ., k}.We use the term motif for the corresponding network F = ([k], A F ).A motif is a network, and we use such motifs to sample from a given network.The type of motif in which we are particularly interested is a k-chain, for which A F = 1({(1, 2), (2, 3), . . ., (k − 1, k)}).A k-chain is a directed path with node set [k].For a general k-node motif F = ([k], A F ) and a network G = (V, A), we define the probability distribution π F →G on the set V [k] of all node maps x : where Z is a normalizing constant that is called the homomorphism density of F in G [27].A node map with A F (a, b) > 0 (with the convention that α 0 = 1 for all α ∈ R).When both A and A F are binary matrices, the probability distribution π F →G is the uniform distribution on the set of all homomorphisms F → G.This is the case for all examples that we discuss in the main manuscript.Motif sampling refers the problem of sampling a random homomorphism x : F → G according to the distribution in (1).In Section C, we discuss three Markov Chain Monte Carlo (MCMC) algorithms for motif sampling.
We call A x in (2) the mesoscale patch of G that is induced by the homomorphism x : F → G, which is specified uniquely by the homomorhism x : F → G, the weight matrix A, and the mask Φ.For a k × k matrix B and a homomorphism x : F → G, we say that the (a, b) entries of B are on-chain if A F (a, b) > 0 and off-chain otherwise.The condition A F (a, b) > 0 implies that A(x(a), x(b)) by the definition of the homomorphism x, so the on-chain entries of A x are always positive (and are always 1 if G is unweighted).However, the off-chain entries of A x are not necessarily positive, so they encode meaningful information about a network that is 'detected' by the homomorphism x : F → G.For example, see the (1, 4) entries in the 4 × 4 mesoscale patches in Figure 5b.Suppose that the homomorphism x uses k distinct nodes of G in its image Im(x t ) := {x(a) | a ∈ {1, . . ., k}}.It then follows that the network G x := (Im(x), A x ) that is induced by x is a k-node subnetwork of G.In this case, the no-folding mask (4) is the same as the identity mask (3).However, when Im(x) has fewer than k nodes, G x is not necessarily a subnetwork of G and distinct positive off-chain entries of A x with the identity mask may represent the same edge in G.We introduced the no-folding mask in (4) so that the positive off-chain entries of A x always have information that is not included in its on-chain entries.
As an illustration, consider the case in which F is the 6-chain motif and G = (V, A) is an undirected and binary graph.Suppose that we have the following two 6 × 6 binary matrices: , where each entry * of P is either 0 or 1. Suppose that Φ F,x is the identity mask (3).It then follows that any homomorphism x : F → G from the 6-chain motif F always induces a mesoscale patch A x of the form P , where the 1 entries in the two diagonals correspond to the on-chain entries.Suppose that x uses only two distinct nodes, p and q, in G. Specifically, let p = x(1) = x(3) = x( 5) and q = x(2) = x(4) = x(6).In this case, the mesoscale patch A x equals Q, which has four off-chain entries of 1 in its upper triangle.However, all of the 1 entries in A x in this case correspond to the single edge between p and q in G, so the off-chain entries of A x do not give any new information about the network G that is not already given by its on-chain entries.The indicator function in the definition of the no-folding mask (4) prevents this situation.Indeed, in this case, the mesoscale patch A x with the no-folding mask is the matrix P with all * entries equal to 0.

B.4.
Problem formulation for network dictionary learning (NDL).The goal of the NDL problem is to learn, for a fixed integer r ≥ 1, a set of r nonnegative matrices L 1 , . . ., L r of size k × k, Frobenius norms of at most 1, and for each homomorphism x : F → G for some coefficients a 1 (x), . . ., a r (x) ≥ 0. For each homomorphism x : F → G, this implies that one can approximate the mesoscale patch A x of G that is induced by x as a suitable linear combination of the r matrices L 1 , . . ., L r .We call the the tuple [L 1 , . . ., L r ] a network dictionary for G, and we call each L i a latent motif of G.We identify a network dictionary [L 1 , . . ., L r ] with the nonnegative matrix W ∈ R k 2 ×r ≥0 whose j th column is the vectorization of the j th latent motif L j for j ∈ {1, . . ., r}.The choice of vectorization R k×k → R k 2 can be arbitrary, but we use a column-wise vectorization in Algorithm A4.
For the latent motifs to be interpretable, it is crucial that we require the entries of latent motifs L i and the coefficients a i (x) to be nonnegative.The nonnegativity constraint on each L i allows one to interpret each L i as the weight matrix of a k-node network.Additionally, because the coefficients a j (x) are also nonnegative, the approximate decomposition in (5) implies that a i (x)L i A x .Therefore, if a i (x) > 0, any network structure (e.g., nodes with large degree, communities, and so on) in the latent motif L i must exist in A x .Therefore, one can consider the latent motifs as approximate k-node subnetworks G that exhibit typical network structure of G at scale k.In the spirit of Lee and Seung [19], one can regard the latent motifs as 'parts' of a network G1 .
For a more precise formulation of ( 5), consider the following optimization problem: where π F →G is the probability distribution that we defined in (1) and • F denotes the matrix Frobenius norm.The choice of the probability distribution π F →G for the homomorphisms x : F → G is natural because it becomes the uniform distribution on the set of all homomorphisms F → G when the adjacency matrices for G and F are both binary.The NDL problem is computationally difficult because the objective function in ( 6) is non-convex and it is not obvious how to sample a homomorphism F → G according to the distribution π F →G that we defined in (1).In Section D, we state an algorithm for NDL that approximately solves (6).
B.5. Overview of our algorithms and their theoretical guarantees.We overview our algorithms and their theoretical guarantees.
Algorithm 1: Given a network G, the NDL algorithm (see Algorithm 1) computes a sequence (W t ) t≥0 of network dictionaries, which take the form of k 2 × r matrices, of latent motifs.Algorithm 2: Given a network G, a network dictionary W , and a threshold parameter θ > 0, the NDR algorithm (see Algorithm 2) computes a sequence of weighted networks G recons and binary (i.e., unweighted) networks G recons (θ).Theorem G.2: Given a non-bipartite network G and a choice of the parameters in Algorithm 1, we prove that the sequence (W t ) t≥0 of network dictionaries converges almost surely to the set of stationary points of the associated objective function in (6).Theorem G.5: Given a bipartite network G and a choice of the parameters in Algorithm 1, we prove a convergence result that is analogous to the one in Theorem G.2. Theorem G.7: Given a non-bipartite target network G and a network dictionary W , we show that (i) the sequence of weighted reconstructed networks G recons that we compute using the NDR algorithm (see Algorithm 2) converges almost surely to some limiting network, and (ii) we obtain a closed-form expression of this limiting network.We also show that (iii) a suitable distance between an arbitrary network G and the limiting reconstructed network is bounded by the mean L 1 distance between the k × k mesoscale patches of G and their nonnegative linear approximations from the latent motifs in W . Finally, (iv) if G = G in (iii), we show that upper bound of the distance between G and G recons is approximately optimized if one learns the network dictionary W from the NDL algorithm 1. Theorem G.10: We show a convergence result that is analogous to the one in Theorem G.10 for a bipartite target network G.

Appendix C. Markov Chain Monte Carlo (MCMC) Motif-Sampling Algorithms
We mentioned in Section B.4 that one of the main difficulties in solving the optimization problem ( 6) is to directly sample a homomorphism x : F → G from the distribution π F →G (see (1)).To overcome this difficulty, we use the Markov Chain Monte Carlo (MCMC) algorithms that were introduced in [29].Although the algorithms in [29] apply to networks with edge weights and/or node weights, we only use simplified forms of them that we give in Algorithms MP and MG.Additionally, Algorithm MP with the option AcceptProb = Approximate is a novel algorithm of the present paper.Using these MCMC sampling algorithms, we generate a sequence (x t ) t≥0 of homomorphisms F → G such that the distribution of x t converges to π F →G under some mild conditions on G and F [30, Thm.5.7].
In the pivot chain (see Algorithm MP with AcceptProb=Exact), for each update x t → x t+1 , the pivot x t (1) first performs a random-walk move on G (see (7)) to move to a new node x t+1 (1) ∈ V .It accepts this move with a suitable acceptance probability (see (8)) according to the Metropolis-Hastings algorithm (see, e.g., [25,Sec. 3.2]).After the move of x t (1) → x t+1 (1), we sample each x t+1 (i) ∈ V for i ∈ {2, 3, . . ., k} successively from the appropriate conditional distribution (see (9)).This ensures that the desired distribution π F →G in ( 1) is a stationary distribution of the resulting Markov chain.In the Glauber chain, we pick one node i ∈ [k] of F uniformly at random, and we resample its location x t (i) ∈ V (G) at time t to = x t+1 (i) ∈ V from the correct conditional distribution in (10) (see Figure 5a).See [25,Sec. 3.3] for background on the Metropolis-Hastings algorithm and Glauber-chain MCMC sampling.
Let ∆ denote the maximum degree (i.e., number of neighbors) of the nodes in the network G = (V, A).We also say that the network G itself has a maximum degree of ∆.The Glauber chain has an efficient local update (with a computational complexity of O(∆)), but it converges quickly to the stationary distribution π F →G only for networks that are dense enough so that two homomorphisms that differ at one node have a probability of at least 1/(2∆) to coincide after a single Glauber chain update.(See [29,Thm. 6.1] for a precise statement.) By contrast, the pivot chain (see Algorithm MP with AcceptProb = Exact) has more computationally expensive local updates with a computational complexity of O(∆ k−1 ) (as discussed in [29,Remark 5.6]), but it converges as fast as a 'lazy' random walk on a network.(In such a random walk, each move has a chance to be rejected according to a Metropolis-Hastings algorithm; see [29,Thm. 6.2].)Experimentally, we find Algorithm MP .Pivot-Chain Update Else:

4:
Define a new homomorphism x : the Glauber chain is slow, especially for sparse networks (e.g., for COVID PPI, which has an edge density of 0.0010, and UCLA, which has an edge density of 0.0037) and that the pivot chain is too expensive to compute for chain motifs with k ≥ 21.As a compromise with a low computational complexity and fast convergence (as fast as the standard random walk), we employ an approximate pivot chain, which is Algorithm MP with the option AcceptProb = Approximate.Specifically, we compute the acceptance probability α in (8) only approximately and thereby reduce the computational cost to O(∆).The compromise, which we discuss in the next paragraph, is that the stationary distribution of the approximate pivot chain may be slightly different from our target distribution π F →G .
According to Proposition G.1, the stationary distribution for the approximate pivot chain is In general, the distribution ( 11) is different from the desired target distribution π F →G .Specifically, π F →G (x) is proportional only to the numerator in (11) and the sum in the denominator in ( 11) is a weighted count of homomorphisms y : F → G for which y(1) = x(1).Therefore, under πF →G , we penalize the probability of each homomorphism x : F → G according to the number of k-step walks in G that start from x(1) ∈ V .
(The exact acceptance probability in (8) neutralizes this penalty.)It follows that πF →G is close to π F →G when the k-step-walk counts that start from each node in G do not differ too much for different nodes.For example, on degree-regular networks like lattices, such counts do not depend on the starting node, and it thus follows that πF →G = π F →G .Nevertheless, despite the potential discrepancy between π F →G and πF →G , the approximate pivot chain gives good results for the reconstruction and denoising experiments that we showed in Figures 3 and 4 in the main manuscript.
Appendix D. Algorithm for Network Dictionary Learning (NDL) Algorithm overview and statement.The essential idea behind our algorithm for NDL (see Algorithm 1) is as follows.We first sample a large number M of homomorphisms x t : F → G from π F →G and compute their corresponding mesoscale patches A xt for t ∈ {1, . . ., M }.These M mesoscale patches of G form the data set (a so-called 'batch') in which we apply a dictionary-learning algorithm.Specifically, we (column-wise) vectorize each of these k × k matrices (using Algorithm A4) and obtain a k 2 × M data matrix X, and we then apply nonnegative matrix factorization (NMF) [19] to obtain a k 2 × r nonnegative matrix W for some fixed integer r ≥ 1 to yield an approximate factorization X ≈ W H for some nonnegative matrix H. From this procedure, we approximate each column of X by the nonnegative linear combination of the r columns of W with coefficients that are given by the r th column of H. Therefore, if we let L i be the k × k matrix that we obtain by reshaping the i th column of W (using Algorithm A5), then [L 1 , . . ., L r ] is an approximate solution of ( 6).We will give the precise meaning of 'approximate solution' in Theorems G.2 and G.5.The scheme in the paragraph above requires one to store all M mesoscale patches, entailing a memory requirement that is at least of order k 2 M .Because M should scale with the size of G, this implies that we need unbounded memory to handle arbitrarily large networks.To address this issue, Algorithm 1 implements the above scheme in the setting of 'online learning', where subsets (so-called 'minibatches') of data arrive in a sequential manner and one does not store previous subsets of the data before processing new subsets.Specifically, at each iteration t = 1, 2, . . ., T , we only process a sample matrix X t that is smaller than the full matrix X and includes only N M mesoscale patches, where one can take N to be independent of the network size.Instead of the standard NMF algorithms for a fixed matrix [20], we use an 'online' NMF algorithm [30,32] that applies to sequences of matrices, where the intermediate dictionary matrices W t that we obtain by factorizing the sample matrix X t typically improves as we iterate (see [30,32]).In Algorithm 1, we give a full implementation of the NDL algorithm.
We now explain how the NDL algorithm works.It combines one of the three MCMC algorithms -a pivot chain (in which we use Algorithm MP with AcceptProb = Exact), an approximate pivot chain (in which we use Algorithm MP with AcceptProb = Approximate), and a Glauber chain (in which we use Algorithm MG) -for motif sampling that we presented in Section C with the online NMF algorithm of [30].Suppose that we have an undirected and binary graph G = (V, A) and a k-chain motif F = ([k], A F ).We satisfy the requirement in Algorithm 1 that there exists at least one homomorphism F → G (as long as G has at least one edge), so we can find an initial homomorphism x 0 : F → G by rejection sampling (see Algorithm A3).At each iteration t = 1, 2, . . ., T , the motif-sampling algorithm generates a sequence x s : F → G of N homomorphisms and corresponding mesoscale patches A xs (see Figure 5a), which we summarize as the k 2 × N data matrix X t .The online NMF algorithm in (12) learns a nonnegative factor matrix W t of size k 2 × r by improving the previous factor matrix W t−1 with respect to the new data matrix X t .It is an 'online' NMF algorithm because it factorizes a sequence (X t ) t∈{1,...,T } of data matrices, rather than a single matrix as in conventional NMF algorithms [20].During this entire process, the algorithm only needs to store auxiliary matrices P t and Q t of fixed sizes r × r and r × k 2 , respectively; it does not need the previous data matrices X 1 , . . ., X t−1 .Therefore, NDL is efficient in memory and scales well with network size.Moreover, NDL is applicable to time-dependent networks because of its online nature, although we do not pursue this direction in the present paper.MCMC update and sampling mesoscale patches: For s = N (t − 1) + 1, . . ., N t: X t ← k 2 × N matrix whose j th column is vec(A x ) with = N (t − 1) + j (where vec(•) denotes the vectorization operator defined in Algorithm A4) 16: Single iteration of online nonnegative matrix factorization: where In (12), we solve convex optimization problems to find matrices H t ∈ R r×N and W t ∈ R k 2 ×r .The first subproblem in (12) is a coding problem.Given two matrices X t and W t−1 , we seek to find a factor matrix (i.e., a 'code matrix') H t such that X t ≈ W t−1 H t .The parameter λ ≥ 0 is an L 1 -regularizer, which encourages H t to have a small L 1 norm.One can solve the coding problem efficiently by using Algorithm A1 or one of a variety of existing algorithms (e.g., LARS [6], LASSO [54], or feature-sign search [21]).The second and third lines in (12) update the 'aggregate matrices' P t−1 ∈ R r×r and Q t−1 ∈ R r×k 2 by taking a weighted average of them with the new information X t H T t ∈ R r×r and H t X T t , respectively.We weight the old aggregate matrices by 1 − t −1 and the new information by t −1 .By induction, we obtain We use the updated aggregate matrices, P t and Q t , in the last subproblem in (12).This problem is called the dictionary-update problem and yields W t .This is a constrained quadratic problem, and we can solve it using projected gradient descent (see Algorithm A2).In all of our experiments, we take the compact and convex constraint set R k 2 ×r ≥0 to be the set of W ∈ R k 2 ×r ≥0 whose columns have a Frobenius norm of at most 1 (as required in ( 6)).D.2.Dominance scores of latent motifs.In this subsection, we introduce a quantitative measurement of the 'prevalence' of latent motifs in the network dictionary W T that we compute using NDL (see Algorithm 1) for a network G.
Recall that the output of the NDL algorithm for a network G using a k-chain motif is a network dictionary W T , which consists of r latent motifs L 1 , . . ., L r of size k ×k.Recall as well that the algorithm computes data matrices X 1 , . . ., X T of size k 2 × N .Suppose that we have code matrices H 1 , . . ., H T such that X t ≈ W T H t for all t ∈ {1, . . ., T }.More precisely, we let where we take the arg min over all The columns of H t encode how to nonnegatively combine the latent motifs in W T to approximate the mesoscale patches in , so the rows of H t encode the linear coefficients of each latent motif in W T that we use for approximating the columns of X t .Consequently, the mean inner products of the rows of H t encode the mean usage of the latent motifs in W T in G.This motivates us to consider the following mean Gramian matrix [13]: We then can take the square root of the diagonal entries of P T to give us the mean prevalences of the latent motifs in W T in G.
Computing P T requires one to store the previous data matrices X 1 , . . ., X T and compute H 1 , . . ., H T by solving (13) for t ∈ {1, . . ., T }; this is a very expensive computation.To address this issue, we use the aggregate matrix P T that we compute as part of Algorithm 1.Therefore, it does not require any extra computation.Note that where is the code matrix that is given by H t = arg min H≥0 ( X t − W t−1 H 2 + λ H 1 ).Note that P T is an approximation of P T because the defining equation of H t is the same as that of H t in ( 13) with W T replaced by W t−1 .However, the approximation error between P T and P t vanishes as T → ∞ under mild conditions.Specifically, under the hypothesis of Theorems G.2 and G.5, W t converges almost surely to some limiting dictionary.It follows that P T − P T F → 0 almost surely as T → ∞.In Figure 2 of the main manuscript, we showed the second-most dominant latent motifs for twelve networks at the scales k = 6, 11, 21, 51, 101.In Figure 6, we show the most dominant latent motifs for the same sets of networks and scales.See Figures 8, 9, 10, and 14 in Section I for all 25 latent motifs (along with their dominance scores) of all of the networks in Figure 6 at all five scales.For many of these networks (e.g., Harvard, UCLA, arXiv, and ER 1 ) and at all five scales, the most dominant latent motifs in Figure 6 are close to the adjacency matrix of the k-chain itself.In these examples, the entries in the first superdiagonal and subdiagonal overwhelm the rest of the entries.However, for all networks except ER 1 , the secondmost dominant latent motifs in Figure 2 reveal more interesting mesoscale structures (e.g., communities in Harvard, MIT, UCLA, and arXiv and bipartition for edges that are not in the 11-chain in Coronavirus PPI) than the most dominant latent motifs in Figure 6.D.3.Influence of masking and MCMC algorithms on latent motifs.Our NDL algorithm (see Algorithm 1) uses mesoscale patches A x with a mask that consists of either the identity mask (3) or the no-folding mask (4), but one can generalize it for any choice of mask.The original NDL algorithm [30, Algorithm 1] corresponds to Algorithm 1 with the option mask = Id.the one with identity mask for mesoscale patches.In Section B, we discussed that the no-folding mask improves the interpretability of the positive entries in the mesoscale patches.Consequently, it also improves the interpretability of the latent motifs that we learn from them using Algorithm 1.In Figure 7, we compare r = 25 latent motifs that we learn from Coronavirus PPI four different combinations of masks and MCMC motif-sampling.In Figure 7c,d, we see the latent motifs that we learn with the identity mask have large off-chain entries that dominate the on-chain entries.(See Section B.3 for definitions of on-chain and off-chain entries.)The most dominant latent motifs appear as complete bipartite networks, in which each node in one set is adjacent to all nodes in the other set in the bipartition, with 21 nodes.This is counterintuitive, as Coronavirus PPI has a very low edge density of 0.0010.However, as we see in Figure 7a,b, the latent motifs that we learn with the no-folding mask (4) have comparatively sparser off-chain entries, which are dominated by the on-chain entries.However, by comparing Figure 7a with Figure 7b and Figure 7c with Figure 7d, we also see that the choice of the MCMC algorithm between the approximate pivot chain and the Glauber chain for Algorithm 1 may not affect the latent motifs as much as whether we use the identity mask or the no-folding mask.
Appendix E. Algorithms for Network Denoising and Reconstruction (NDR) E.1.Algorithm overview and statement.The standard pipeline for image denoising and reconstruction [7,33,35] is to uniformly randomly sample a large number of k × k overlapping patches of an image and then average their associated approximations at each pixel to obtain a reconstructed version of the original image.A reasonable network analog of this pipeline is as follows.Given a network G = (V, A), a motif F = ([k], A F ), and a network dictionary with latent motifs [L 1 , . . ., L r ], we uniformly randomly sample a large number T of homomorphisms x t : F → G.To simplify the present discussion, we assume that each x t uses k distinct nodes of G in its image V (t) := {x t (a) ∈ V | a ∈ {1, . . ., k}}.(We do not make this assumption elsewhere in the paper.)This yields T k-node subnetworks G (t) = (V (t) , A (t) ), where A (t) is the mesoscale patch A xt in (2).We then approximate each weight matrix A (t) by a nonnegative linear combination Â(t) of the latent motifs L i .We then define a network G recons = (V, A recons ), where we set A recons (p, q) for each p, q ∈ V to be the mean of the approximate weights Â(t) (p, q) for all t ∈ {1, . . ., M } such that p, q ∈ V (t) .
Our network denoising and reconstruction (NDR) algorithm (see Algorithm 2) builds on the idea in the preceding paragraph.Suppose that we have a network G = (V, A), a motif F = ([k], A F ), and a network dictionary W that consists of r nonnegative k × k matrices L 1 , . . ., L r .First, because uniformly randomly sampling a homomorphism x t : F → G is not as straightforward as uniformly randomly sampling k × k patches of an image, we generate a sequence (x t ) t∈{0,...,T } of homomorphisms using a MCMC motif-sampling algorithm (see Algorithms MP and MG).For each t ≥ 0, we approximate the mesoscale patch A xt (see (2)) by a nonnegative linear combination of latent motifs L i and we then take the mean of the values of each entry A(a, b) up to time t.A recons , A count : V 2 → {0} (matrices with 0 entries) Sample a homomorphism x 0 : F → G by the rejection sampling in Algorithm A3 X t ← k 2 × 1 matrix that we obtain by vectorizing A xt (using Algorithm A4) 13: Mesoscale reconstruction: 14: and Xt ← W H t

16:
Âxt;W ← k × k matrix that we obtain by reshaping the k 2 × 1 matrix Xt using Algorithm A5

17:
Update global reconstruction: For a, b ∈ {1, . . ., k}: 19: As in Algorithm 1, we require in Algorithm 2 that there exists at least one homomorphism F → G.This condition is satisfied when F = ([k], A F ) is a chain motif and G has at least one edge; it holds for all of our experiments in the present paper.As in the first line of (12), the problem for finding H t in line 15 of Algorithm 2 is a standard convex problem, which one can solve by using Algorithm A1.There are two E.2.Further discussion of the denoising variant of the NDR algorithm.We now give a detailed discussion of Algorithm 2 with denoising = T for network-denoising applications.Recall that the networkdenoising problem that we consider is to reconstruct a true network G true = (V, A) from an observed network G obs = (V, A ).The scheme that we used to produce Figure 4 is the following:
The NDR algorithm was first introduced in [30].The version of the NDR algorithm from [30] is Algorithm 2 with the options mask = Id and denoising = F.It has a competitive performance for denoising −50% subtractive noise on the networks SNAP FB, H. sapiens, and arXiv in comparison to the methods Spectral Clustering [41], DeepWalk [45], LINE [50], and node2vec [11].In all of these methods, one first obtains a 128-dimensional vector representation of the nodes in a network; this is called a 'node embedding' of the network.One then uses this node embedding to compute vector representations of the edges using binary operations such as the Hadamard product.(See [11] for details.)One can then use a binary-classification algorithm (e.g., a support vector machine [42]) to attempt to detect the false edges.Spectral Clustering uses the top 128 eigenvectors of the normalized Laplacian matrix of G obs to learn vector embeddings of the nodes.(See [51] for details.)The other three benchmark methods first generate sequences of nodes using random-walk sampling and then use a word-embedding technique (see, e.g., [39]) to learn a node embedding.
In Table 1, we compare the AUC scores of our NDR and NDL approach that we described at the beginning of Section E.2 for our network-denoising tasks to the AUC scores that one obtains using the above four existing methods.As was discussed in [30, Remark 4], a limitation of using NDR with mask = Id and denoising = F for network denoising is that one needs to invert what it means to classify successfully depending on whether noise is additive or subtractive.Specifically, [30] used network denoising with NDR for additive noise, but it is necessary to classify each non-edge (p, q) as 'positive' if A recons (p, q) < θ.Therefore, the AUC scores for −50% noise in the fifth in Table 1 is 'flipped' and the same is true for the sixth row in the table for mask = NF and denoising = F. Our NDR algorithm with denoising = T addresses this directionality issue and allows us to use the unified classification scheme above for both additive and subtractive noise.
The idea behind NDR with denoising = T is to handle an issue when denoising additive noise for sparse real-world networks that does not arise in the image-denoising setting.Suppose that we obtain G obs by adding some false edges to a sparse binary network G true .The on-chain entries of the mesoscale patches A x are always equal to 1. Therefore, the latent motifs that we learn from G obs have constant on-chain entries (see, e.g., Figures 2 and 6 4, we first use NDL to learn latent motifs from a corrupted network and then reconstruct the networks using NDR to assign a confidence value to each potential edge.We use the networks SNAP FB, H. sapiens, and arXiv with −50% and +50% noise on the edges.In the last four rows, mk stands for the choice of mask in NDR and dn stands for the denoising approach in NDR.We use mask = NF for NDL in each of the last four rows.For both NDL and NDR, we use MCMC = ApproxPivot.The last row is a duplicate from Figure 4, and we obtain the results in rows five-seven using the same parameter choices as in Figure 4.For the last four rows except the six instances in italics (i.e., for dn = F and −50% noise), we use the classification scheme (D.3).For these six instances, we compute the AUC of the 'flipped' ROC curve, as the algorithm technically takes on the opposite classification task.When dn = T, we do not flip the classification task.In the first four rows, we construct 128-dimensional vector representations of the networks using the indicated methods, and we then use them for edge classification.
motifs that we learn from G obs cannot distinguish between true and false on-chain entries.Furthermore, because G obs is sparse, there are many fewer positive off-chain entries in A x than the number of on-chain entries of A x .Therefore, linear approximations of A x using the latent motifs are likely to assign larger weights to reconstruct on-chain entries of A x than off-chain entires.The resulting reconstruction of G obs is thus similar to G obs , and it is very hard to detect any false edges in G obs .Using the option denoising = T prevents this issue by ignoring all on-chain entries both for each sampled mesoscale patch A x and for each latent motif in the network dictionary W that we use for denoising.For example, using denoising = T instead of denoising = F for SNAP FB and arXiv (see Table 1) yields a performance gain of about 10% for +50% noise.Another issue is using NDL with denoising = F for denoising negative noise in sparse real-world networks.In many of our experiments in this situation, our network-denoising scheme in (D.1)-(D.3)seems to give 'flipped' ROC curves that lie below the diagonal line y = x that represents the baseline ROC curve and the AUC score is thus close to 0 instead of 1.Therefore, in the reconstructed network G recons , false non-edges have larger weights than true non-edges.For the six instances in italics (for denoising = F and −50% noise) in Table 1, we give AUC scores after flipping the ROC curves, such that we classify a non-edge (p, q) as 'positive' if A recons (p, q) < θ, which is the opposite of (D.3) that we used for additive noises in all cases.It is unfortunate to have to use the opposite classification scheme for different types of noise, especially when one may not know the type of noise in advance.We suspect that the sparsity of real-world networks may lead to such 'opposite directionality'; this phenomenon requires further investigation.In contrast to this situation, the ROC curves for denoising = T are always above the diagonal in all experiments in Table 1 (as well as in Figure 4), regardless of whether the noise is additive or subtractive, and we always obtain the AUC scores in Table 1 using the scheme in (D.3).
Appendix F. Experimental details F.1.Data sets.We now describe the eight real-world networks that we examined in the main manuscript: (1) Caltech: This connected network, which is part of the Facebook100 data set [56] [11,44,53]: This network has 24,407 nodes and 390,420 edges.
Its largest connected component has 24,379 nodes and 390,397 edges.We use the full network in our experiments.The nodes represent proteins in the organism Homo sapiens, and the edges represent physical interactions between these proteins.We now describe the six synthetic networks that we examined in the main manuscript: (9) ER 1 and ER 2 : An Erdős-Rényi (ER) network [8,40], which we denote by ER(n, p), is a randomgraph model.The parameter n is the number of nodes and the parameter p is the independent, homogeneous probability that each pair of distinct nodes has an edge between them.The network ER 1 is an individual graph that we draw from ER(5000, 0.01), and ER 2 is an individual graph that we draw from ER(5000, 0.02).(10) WS 1 and WS 2 : A Watts-Strogatz (WS) network, which we denote by WS(n, k, p), is a random-graph model to study the small-world phenomenon [40,57].In the version of WS networks that we use, we start with an n-node ring network in which each node is adjacent to its k nearest neighbors.With independent probability p, we then remove and rewire each edge to a pair of distinct nodes that we choose uniformly at random.The network WS 1 is an individual graph that we draw from WS(5000, 50, 0.05), and WS 2 is an individual graph that we draw from WS(5000, 50, 0.10).(11) BA 1 and BA 2 : A Barabási-Albert (BA) network, which we denote by BA(n, m), is a random-graph model with a linear preferential-attachment mechanism [2,40].In the version of BA networks that we use, we start with m isolated nodes and we introduce new nodes with m new edges each that attach preferentially (with a probability that is proportional to node degree) to existing nodes until we have a total of n nodes.The network BA 1 is an individual graph that we draw from BA(5000, 25), and BA 2 is an individual graph that we draw from WS(5000, 50).In the main manuscript, we mentioned that the following claims follow from the reconstruction accuracies that we reported in Figure 3 in conjunction with the latent motifs in Figures 1, 2, 8, and 10.We now provide their justifications.
(1) The mesoscale structure of Caltech is rather different from those of Harvard, UCLA, and MIT at scale k = 21.
• In Figure 3c, we observe that the accuracy of the cross-reconstruction Y ← X is consistently higher for X ∈ {UCLA, Harvard, MIT} than for X = Caltech for all values of r.For instance, at r = 9, we can reconstruct UCLA with more than 90% accuracy and we can reconstruct Harvard and MIT with more than 80% accuracy.However, the latent motifs that we learn from Caltech for r = 9 gives only about 80% accuracy for reconstructing UCLA and only about 70% accuracy for reconstructing MIT and UCLA.This indicates that Caltech has significantly different mesoscale structure than the other three universities' Facebook networks at scale k = 21.Indeed, from Figures 1, 2, 6, and 8, we see that the r = 25 latent motifs of Caltech at scale k = 21 have larger off-chain entries than those of UCLA, MIT, and Harvard.(2) The mesoscale structure of Caltech at scale k = 21 has a higher dimension than those of the other three universities' Facebook networks.
The accuracy with r = 9 for the latent motifs that we learn from Caltech itself is as low as 64%.By contrast, it 80% or higher for the self-reconstructions X ← X for the Facebook networks of the other universities.In other words, r = 9 latent motifs at scale k = 21 cannot approximate the mesoscale structures of Caltech as well as those of the other three universities' Facebook networks.This indicates that the dimension of the mesoscale structures of Caltech at scale k = 21 is larger than those of the other three universities' Facebook networks.
(3) The networks BA 1 and BA 2 are better than ER 1 , ER 2 , WS 1 , and WS 2 at capturing the mesoscale structures of MIT, Harvard, and UCLA at scale k = 21.
• From the reconstruction accuracies for Y ← X in Figure 3b,c, where X is one of the six synthetic networks (ER i , WS i , and BA i for i ∈ {1, 2}), we observe that the two BA networks have higher accuracies then the networks from the ER and WS models for Y ∈ {UCLA, Harvard, MIT}.This suggests that the mesoscale structures of UCLA, Harvard, and MIT may be more similar in some respects to those of BA i than those of ER i and WS i .The latent motifs of BA 1 in Figures 2 and  8 at k = 21 have characteristics that we also observe in UCLA, Harvard, and MIT.(Specifically, they have hub nodes and off-chain entries that are much smaller (so they are lighter in color) than the on-chain entries.)By contrast, in Figure 14, we see that the latent motifs for ER 1 have sparse but seemingly randomly distributed off-chain connections and the ones for WS 1 have strongly interconnected communities of about 10 nodes (see the diagonal block of black entries).These patterns differ from the ones that we observe in the latent motifs of UCLA, MIT, and Harvard (see Figure 8).(4) If we uniformly sample a walk of k = 21 nodes, then it is more likely that there are communities with 10 or more nodes in the induced subnetwork for Caltech than is the case for UCLA, Harvard, and MIT.
• From the reconstruction accuracies for Y ← X in Figure 3b,c, where X is one of the six synthetic networks (ER i , WS i , and BA i for i ∈ {1, 2}), we observe that WS networks outperform both BA and ER networks in reconstructing Caltech, but they are one of the lowest-performing networks for reconstructing the Facebook networks of the other three universities.In other words, nonnegative linear combinations of the latent motifs of WS i can better approximate the mesoscale patches of Caltech than they can for those of UCLA, Harvard, and MIT.Recall that most latent motifs of WS i at scale k = 21 have blocks of black entries of size 10 × 10 or larger (see Figure 10); these correspond to adjacency matrices of communities with 10 or more nodes.Therefore, such community structure should be more likely to occur in subnetworks that are induced from uniform samples of k = 21-node walks in Caltech than from such samples in UCLA, Harvard, or MIT.We obtain the AUC scores in Figure 4 for the methods node2vec [11], DeepWalk [45], and LINE [50] for the task of denoising 50% subtractive noise for SNAP Facebook, H. Sapiens, and arXiv from [11].In [11], the ROCs were computed from a 'balanced test set' that was chosen uniformly at random from all sets that include all of the |E|/2 false non-edges and |E|/2 true non-edges, where |E| denotes the number of edges in the network.By contrast, we compute our ROCs in Figure 4  true non-edges.Therefore, our results are directly comparable to the AUCs in [11] after uniformly sampling a balanced test set that consists of |E|/2 true non-edges and |E|/2 false non-edges.

Appendix G. Convergence Analysis
In this section, we give rigorous convergence guarantees for our main algorithms for NDL and NDR.In [30,Corollary 6.1], Lyu et al. obtained a convergence guarantee of the original NDL algorithm ([30, Algorithm 1]) for non-bipartite networks G with the MCMC motif-sampling algorithms MCMC ∈ {Pivot, Glauber}.Theorem G.2(i) in the present paper establishes the same result for the NDL algorithm (see Algorithm 1) that uses mesoscale patches (2) with either the identity mask or the no-folding mask (4).Theorem G.2(ii) gives a similar convergence result for the NDL algorithm with the approximate pivot chain (i.e., for MCMC = PivotApprox).Theorem G.5 establishes a similar convergence result for NDL for bipartite networks G. Theorems G.7 and G.10 establish convergence and error bounds for the NDR algorithm (see Algorithm 2).Except for Theorem G.2(i), all of these theoretical results are novel results of the present paper.
Let F = ([k], A F ) be the k-chain motif, and let G = (V, A) be a network.Let Ω ⊆ V [k] denote the set of all homomorphisms x : F → G. Algorithm 1 generates three stochastic sequences.The first one is the sequence (x t ) t≥0 of homomorphisms F → G that are generated by the pivot chain (see Algorithm MP).The second one is the sequence (X t ) t≥0 of k 2 × N data matrices whose columns encode N mesoscale patches of G.More precisely, for each y 1 , . . ., y n ∈ Ω, we write Ψ(y 1 , . . ., for the k 2 × N matrix whose i th column is the vectorization (using Algorithm A4) of the corresponding k × k mesoscale patch A yi of G.For each y 0 ∈ Ω, define where we generate y 1 , . . ., y N using the pivot chain when it starts at y 0 .It then follows that X t = X (N ) (x N t ) for each t ≥ 1, where x N t is the state of the pivot chain at time N t.For the third (and final) sequence that we generate using Algorithm 1, let (W t ) t≥0 denote the sequence of dictionary matrices, where we define each W t = W t (x 0 ) via ( 12) with an initial homomorphism x 0 : F → G that we sample using Algorithm A3.G.1.Convergence of the approximate pivot chain.We prove Proposition G.1, which states the convergence of the approximate pivot chain and gives an explicit formula for its unique stationary distribution.
Proposition G.1.Fix a network G = (V, A) and the k-chain motif F = ([k], A F ). Let (x t ) t≥0 denote a sequence of homomorphisms x t : F → G that we generate using the approximate pivot chain (in which we use Algorithm MP with AcceptProb = Approximate).Suppose that (a) The weight matrix A is 'bidirectional' (i.e., A(a, b) > 0 implies that A(b, a) > 0 for all a, b ∈ V ) and the undirected and binary graph (V, 1(A > 0)) is connected and non-bipartite.It then follows that (x t ) t≥0 is an irreducible and aperiodic Markov chain with the unique stationary distribution πF →G that we defined in (11).
Proof.We follow the proof of [29,Thm. 5.8].Let P : V 2 → [0, 1] be a matrix with entries This is the transition matrix of the standard random walk on the network G.By hypothesis (a), P is irreducible and aperiodic.Additionally, it has the unique stationary distribution (see [25,Ch. 9]) The approximate pivot chain generates a move x t (1) → x t+1 (1) of the pivot according to the distribution P (x t (1), •).We accept this move of the pivot independently of everything else with the approximate acceptance probability α in (8).If we always accept each move of the pivot, then the pivot performs a random walk on G with unique stationary distribution π (1) .We compute the acceptance probability α using the Metropolis-Hastings algorithm (see [25,Sec. 3.3]), and we thereby modify the stationary distribution of the pivot from π (1) to the uniform distribution on V .(See the discussion in [29,Sec. 5].)Therefore, (x t (1)) t≥0 is an irreducible and aperiodic Markov chain on V that has the uniform distribution as its unique stationary distribution.Because we sample the locations x t+1 (i) ∈ V of the subsequent nodes i = 2, 3, . . ., k independently, conditional on the location x t+1 (1) of the pivot, it follows that the approximate pivot chain (x t ) t≥0 is also an irreducible and aperiodic Markov chain with a unique stationary distribution, which we denote by πF →G .
To compute the stationary distribution πF →G , we decompose x t into return times of the pivot x t (1) to a fixed node x 1 ∈ V in G. Specifically, let τ (j) be the j th return time of x t (1) to x 1 .By the independence of sampling x t over {2, . . ., k} for each t, the strong law of large numbers yields For each fixed homomorphism x : F → G, i → x i , we use the Markov-chain ergodic theorem (see, e.g., [5, Theorem 6.2.1 and Example 6.2.4] or [38,Theorem 17.1.7])to obtain This proves the assertion.
G.2. Convergence of the NDL algorithm.Recall the problem statement for NDL in (6).Informally, we seek to learn r latent motifs L 1 , . . ., L r ∈ R k×k ≥0 to minimize the expectation of the error of approximating the mesoscale patch A x by a nonnegative combination of the motifs L i , where x : F → G is a random homomorphism that we sample from the distribution π F →G (1).We reformulate this problem as the following matrix-factorization problem, which generalizes (6).Let C dict denote the set of all matrices W ∈ R k 2 ×r ≥0 whose columns have a Frobenius norm of at most 1.The matrix-factorization problem is then where we define the loss function The parameters N ∈ N and λ ≥ 0 appear in Algorithm 1.The former is the number of homomorphisms that we sample at each iteration of Algorithm 1, and the latter is the coefficient of an L 1 -regularizer that we use to find the code matrix H t in (12).In the special case of N = 1 and λ = 0, the problem ( 14) is equivalent to the problem (6), because X (1) (x) and the columns of W are vectorizations (using Algorithm A4) of the mesoscale patch A x and the latent motifs L 1 , . . ., L r , respectively.Theorems G.2 and G.5 imply that our NDL algorithm (see Algorithm 1) finds a sequence (W t ) t≥0 of dictionary matrices such that almost surely W t is asymptotically a stationary point of the objective function f in the optimization problem (14).The objective function f is non-convex, so it is generally difficult to find global optimum of f .In practice, however, such stationary points have often been good enough for practical applications, such as image restoration [7,35].We find that this is also the case for our network-denoising problem (see Figure 4).
We are now ready to state our first convergence result for the NDL algorithm (see Algorithm 1).
Theorem G.2 (Convergence of the NDL Algorithm for Non-Bipartite Networks).Let F = ([k], A F ) be the k-chain motif, and let G = (V, A) be a network that satisfies the following properties: (a) The weight matrix A is 'bidirectional' (i.e., A(a, b) > 0 implies that A(b, a) > 0 for all a, b ∈ V ) and the undirected and binary graph (V, 1(A > 0)) is connected and non-bipartite.(b) For all t ≥ 0, there exists a unique solution H t in (12).(c) For all t ≥ 0, the eigenvalues of the positive semidefinite matrix A t that is defined in (12) are at least as large as some constant κ 1 > 0. Let (W t ) t≥0 denote the sequence of dictionary matrices that we generate using Algorithm 1.The following claims hold: (i) For MCMC ∈ {Pivot, Glauber}, we have almost surely as t → ∞ that W t converges to the set of stationary points of the objective function f that we defined in (14).Furthermore, if f has finitely many stationary points in C dict , we then have that W t converges to a single stationary point of f almost surely as t → ∞. (ii) For MCMC = PivotApprox, we have almost surely as t → ∞ that W t converges to the set of stationary points of the objective function where the distribution πF →G is defined in (11).Furthermore, if f has finitely many stationary points in C dict , we then have that W t converges to a single stationary point of f almost surely as t → ∞.
Remark G.3.Assumptions (a)-(c) in Theorem G.2 are all reasonable and are easy to satisfy.Assumption (a) is satisfied if G is undirected, binary, and connected, which is the case for all of our examples in the present paper.Assumptions (b) and (c) are standard assumptions in the study of online dictionary learning [30,31,32].For instance, (b) is a common assumption in methods such as layer-wise adaptive-rate scaling (LARS) [6] that aim to find good solutions to problems of the form (15). Additionally, in practice, one can verify (c) experimentally after a few iterations of Algorithm 1 for a reasonable choice of the initial dictionary (e.g., r samples of mesoscale patches).See [32,Sec. 4.1] and [30,Sec. 4.1] for more detailed discussions of these assumptions.
Remark G.4.It is also possible to slightly modify both the optimization problem ( 14) and our NDL algorithm so that Theorem G.2 holds for the modified problem and the algorithm without needing to assume (b) and (c).The modified problem is arg min where π = πF →G if MCMC = PivotApprox and π = π F →G otherwise.Note that ( 16) is the same as (14) with additional quadratic penalization terms for both H and W in the loss function that we defined in (15).By contrast, consider the modification of the NDL algorithm (see Algorithm 1) in which the objective function for H t in (12) has the additional term λ H 2 F and we replace P t in (12) by P t + κ 1 I. Assuming that λ , κ 1 > 0, the modified objective function for H t is strictly convex, so H t is uniquely defined.Therefore, it satisfies the uniqueness condition (b) in Theorem G.2. Additionally, the smallest eigenvalue of each matrix P t that we compute using the modified NDL algorithm has a lower bound of κ 1 for all t, so it satisfies condition (c) in Theorem G.2.One can then show that the same statement as Theorem G.2 for the modified problem ( 16) and the NDL algorithm hold without assumptions (b) and (c).The argument, which we omit, is almost identical to the one for Theorem G.2.
Proof of Theorem G.2.The proof of the first part of (i) is identical to the proof of [30,Corollary 6.1].For the first part of (ii), we can use the same essential argument as in the proof of [30,Corollary 6.1].However, because our assertion is for the approximate pivot chain that we propose in the present article (see Algorithm MP with AcceptProb = Approximate), we need to use Proposition G.1 (instead of [29,Prop. 5.8]) to establish irreducibility and convergence of our Markov chain.The proof of the second parts of both (i) and (ii) are identical.
We first observe that the matrices X t ∈ R k 2 ×N ≥0 that we compute in line 15 of Algorithm 1 do not necessarily form a Markov chain, because the forward evolution of the Markov chain depends both on the induced mesoscale patches and on the actual homomorphisms (x s ) N (t−1)<s≤N t .However, if one considers the sequence X t := (X t , x N t ), then X t forms a Markov chain.Specifically, the distribution of X t+1 given X t depends only on x N t and A. Indeed, x N t and A determine the distribution of the homomorphisms (x s ) N t<s<N (t+1) , which in turn determine the k 2 × N matrix X t+1 .
With assumption (a), [29, Theorems 5.7 and 5.8] and Proposition G.1 imply that the Markov chain (x t ) t≥0 of homomorphisms F → G is a finite-state Markov chain that is irreducible and aperiodic with a unique stationary distribution π (see (1)).This implies that the N -tuple of homomorphisms (x s ) N (t−1)<s<N t also yields a finite-state, irreducible, and aperiodic chain with a unique stationary distribution.Consequently, the Markov chain (X t ) t≥0 that we defined above is also a finite-state, irreducible, and aperiodic chain with a unique stationary distribution.In this setting, one can regard Algorithm 1 as the online NMF algorithm in [30] for the input sequence X t = ϕ(X t ), where ϕ(X, x) = X is the projection onto its first coordinate.Because X t takes only finitely-many values, the range of ϕ is bounded.This verifies all hypotheses of [30,Theorem 4.1], so the first part of (ii) follows.
Now suppose the there are finitely many stationary points [30,Prop. 7.5]), from the first part of (ii) (which we proved above), it follows that W t converges to W * i for some unique i ∈ {1, . . ., m} almost surely.Our second convergence result for the NDL algorithm is similar to Theorem G.2, but it concerns the case in which the network G is bipartite.Suppose that G is bipartite and that In this case, neither the pivot chain nor the Glauber chain is irreducible on (x t ) t≥0 as Markov chains with state space Ω.However, they are irreducible when we restrict them to each Ω i with a unique stationary distribution π where Z i = x∈Ωi j∈{1,...,k} A(x(j), x(j + 1)) and we defined the homomorphism density Z just below (1).We define two associated conditional expected loss functions: where the equality follows from the first equality in (17).Because the Markov chain (x t ) t≥0 stays in Ω i if it is initialized in Ω i , the above conditional expected loss functions f (i) are the natural objective function to minimize (instead of the full expected loss function f that serves as the objective function in ( 14)).Similarly, for the approximate pivot chain, we define the distributions π(i) F →G and the conditional expected loss functions f (i) as follows.For πF →G as in (11), let We now state our second convergence result for the NLD algorithm.This result is for bipartite networks.A convergence result that is analogous to the one in Theorem 1 holds for the associated conditional expected loss function.This implies that one can initialize homomorphisms in each Ω i and compute sequences W (i) t (with i ∈ {1, 2}) of dictionary matrices to learn stationary points of both associated conditional expected loss functions f (i) .However, when the chain motif F = ([k], A F ) has an even number of nodes, one only needs to compute one sequence of dictionary matrices, because one can obtain the other sequence by the algebraic operation of taking a 'mirror image' of a given square matrix.More precisely, define a map Flip : R k 2 ×r → R k 2 ×r that maps W → W , such that the j th column W (:, j) of W is defined by X(:, j) := vec • rev • reshape(W (:, j)) , j ∈ {1, . . ., r} , where W (:, j) denotes the j th column of W , reshape : R k 2 → R k×k is the reshaping operator that we defined in Algorithm A5, rev maps a k × k matrix K to the k × k matrix ( Kab ) 1≤a,b≤k with entries Kab = K(k − a + 1, k − b + 1), and vec denotes the vectorization operator in Algorithm A4.Applying Flip twice gives the identity map.
Theorem G.5 (Convergence of the NDL Algorithm for Bipartite Networks).Let F = ([k], A F ) be the k-chain motif, and let G = (V, A) be a network that satisfies the the following properties: (a') A is symmetric and the undirected and binary graph (V, 1(A > 0)) is connected and bipartite.(b) For all t ≥ 0, there exists a unique solution H t in (12).
(c) For all t ≥ 0, the eigenvalues of the positive semidefinite matrix A t in (12) are at least as large as some constant κ 1 > 0. Let (W t ) t≥0 denote the sequence of dictionary matrices that we generate using Algorithm 1.We than have the following properties hold: (i) Suppose that MCMC ∈ {Pivot, Glauber}.For each i ∈ {1, 2}, conditional on x 0 ∈ Ω i , the sequence W t of dictionary matrices converges almost surely as t → ∞ to the set of stationary points of the associated conditional expected loss function f (i) that we defined in (18).If MCMC = PivotApprox, then the same statement holds with f (i) replaced by the function f (i) that we defined in (19).We also assume that f (i) (respectively, f (i) ), with i ∈ {1, 2}, have only finitely many stationary points in C dict .It then follows that W t converges to a single stationary point of f (i) (respectively, f (i) ) almost surely as t → ∞. (ii) Suppose that MCMC = Glauber in Algorithm 1 and that k is even.Assume that x 0 ∈ Ω 1 .It then follows that, almost surely, the sequences of dictionary matrices W t and W t converge simultaneously to the sets of stationary points of the expected loss functions f (1) and f (2) , respectively.Moreover, f (1) (W t ) = f (2) (W t ) for all t ≥ 0. We also assume that f (i) (with i ∈ {1, 2}) has only finitely many stationary points in C dict .It then follows that both W t and W t converge to single stationary points of f (1) and f (2) almost surely as t → ∞.
Proof.We first prove (i).Fix j ∈ {1, 2}, and recall the conditional stationary distribution π (i) F →G from (17).Conditional on x 0 ∈ Ω j , the Markov chain (x t ) t≥0 is irreducible and aperiodic with a unique stationary distribution π (j) F →G .Recall that the conclusion of Theorem G.2 holds as long as the underlying Markov chain is irreducible.Therefore, W t converges almost surely to the set of stationary points of the associated conditional expected loss function f (i) that we defined in (18).The same argument verifies the case in which MCMC = PivotApprox.
We now verify (ii).Define the notation µ j := π (i) F →G and suppose that k is even.For each homomorphism x : F → G, we define a map x : for all j ∈ {1, . . ., k} .
For even k, we have that x ∈ Ω 1 if and only if x ∈ Ω 2 .Because A is symmetric, it follows that A(x(j), x(j + 1)) .
Therefore, Z 1 = Z 2 = Z/2.Consequently, for each x ∈ Ω 1 , (17) implies that Consider two Glauber chains, (x t ) t≥0 and (x t ) t≥0 , where x 0 = y and x 0 = y.We evolve these two Markov chains using a common source of randomness so that individually they have Glauber-chain trajectories but they also satisfy the relation x t = x t for all t ≥ 0. (This is typically called a 'coupling argument' in probability literature; see [25,Sec. 4.2].)Specifically, suppose that x t = x t .For each update x t → x t+1 and x t → x t+1 , we choose a node v ∈ [k] uniformly at random and sample z ∈ V according to the conditional distribution (10).We define We then have that x t → x t+1 follows the Glauber-chain update in Algorithm MG.We also have the desired relation Finally, we need to verify that x t → x t+1 also follows the Glauber-chain update in Algorithm MG.It suffices to check that z ∈ V has the same distribution as Because x t = x t , it follows that z is distributed as the conditional distribution (10) of x t+1 (k − v + 1), as desired.
For the two Glauber chains, x t and x t , we observe that almost surely.This result follows from the facts that x t = x t for all t ≥ 0 and rev(A x ) = A x for all x ∈ Ω.
(See (2) for the definition of A x .)From this, we note that The first and the last equalities use the second equality in (17).The second equality uses the fact that (X, W ) = (X, W ). The third equality follows from (21).The fourth equality follows from the change of variables x → x and the fact that x ∼ π if and only if x ∼ π (see (20)).We now prove (ii).Its first part follows immediately from (i) and the above construction of the Glauber chains x t and x t that satisfy x t = x t for all t ≥ 0. Specifically, let W t = W t (x 0 ) and W t = W t (x 0 ) denote the sequences of dictionary matrices that we compute using Algorithm 1 with initial homomorphisms x 0 and x 0 , respectively.Suppose that x 0 ∈ Ω 1 , from which we see that x 0 = x 0 ∈ Ω 2 .By (i), W t and W t converge almost surely to the set of stationary points of the associated conditional expected loss functions f (1) and f (2) , respectively.We complete the proof of the first part of (ii) by observing that, almost surely, The second part of (ii) follows immediately from (22).We still need to verify (23).Roughly, the argument is that all k × k mesoscale patches A xt = ref(A xt ) have the reversed row and column ordering from the original ordering, so k × k latent motifs that we train on such matrices also have the reversed ordering of rows and columns.More concretely, one can check this claim by induction on t together with (21) and the uniqueness assumption (b).We omit the details.
Remark G.6.Our proofs of Theorems G.2 and G.5 do not depend on the particular choice of mask Φ F ;x that we use to define mesoscale patches A x in (2).G.3.Convergence of the NDR algorithm.We prove convergence results for the NDR algorithm (see Algorithm 2) in Theorem G.7. Specifically, we show that the reconstructed network that we obtain using Algorithm 2 at iteration t converges almost surely to some limiting network as t → ∞, and we give a closedform expression of the limiting network.We also give a bound for the 'distance' between the original network and the limiting reconstructed network in terms of a 'fitness' of the network dictionary that we use in the reconstruction algorithm.We measure this fitness using the expected 1-norm between the sampled k × k mesoscale patch and its best nonnegative linear approximation using our network dictionary.
If B = A, then B x = A x equals the mesoscale patch of G that is induced by x (see ( 2)).Additionally, given a network G = (V, A), a motif F = ([k], A F ), a homomorphism x : F → G, and a nonnegative matrix W ∈ R k 2 ×r ≥0 , let Âx;W denote the k × k matrix that we defined in line 11 of Algorithm 2. This matrix depends on the Boolean variable denoising.Recall that Âx;W is a nonnegative linear approximation of A x that uses W .We introduce the event (p, q) x ← (a, b) using the following indicator function: where Φ F,x is the no-folding mask that we defined in (4).For each homomorphism x : F → G and p, q ∈ V , we say that the pair (p, q) is visited by (a, b) through x whenever the indicator on the left-hand side of ( 24) is 1.Additionally, is the total number of visits to (p, q) through x.When N pq (x) > 0, we say that the pair (p, q) is visited by x.In Algorithm 2, observe that both A count (p, q) and A recons (p, q) change at iteration t if and only if N pq (x t ) > 0. Finally, is the set of all homomorphisms x : F → G that visit the pair (p, q).Theorem G.7 (Convergence of the NDR Algorithm (see Algorithm 2) for Non-Bipartite Networks).Let F = ([k], A F ) be the k-chain motif and fix a network G = (V, A) and a network dictionary W ∈ R k 2 ×r .We use Algorithm 2 with inputs G, F , and W and the parameter value T = ∞.Let Ĝt = (V, Ât ) denote the network that we reconstruct at iteration t, and suppose that G satisfies assumption (a) of Theorem G.2. Let π = πF →G if MCMC = PivotApprox and π = π F →G otherwise.The following statements hold: (i) The network Ĝt converges almost surely to some limiting network Ĝ∞ = (V, Â∞ ) in the sense that lim t→∞ Ât (p, q) = Â∞ (p, q) ∈ [0, ∞) almost surely for all p, q ∈ V .(ii) Let Â∞ denote the limiting matrix in (i).For each p, q ∈ V , we then have that (iii) Let Â∞ be as in (ii).For any network G = (V, B), we have (iv) Let Â∞ be as in (ii) and suppose that denoising = F.It then follows that p,q∈V Proof.Let x denote a random homomorphism F → G with distribution π, and let P and E denote the associated probability measure and expectation, respectively.We first verify (i) and (ii) simultaneously.Let (x t ) t≥0 denote the Markov chain that we generate during the reconstruction process (see Algorithm 2).We fix p, q ∈ V and let where we defined the indicator 1 (p, q) x ← (a, b) in (24).The key observation is that = a,b∈{1,...,k} With assumption (a), the Markov chain (x t ) t≥0 of homomorphisms F → G is irreducible and aperiodic with π (see ( 1)) as its unique stationary distribution.This proves both (i) and (ii).

This verifies (iii).
Finally, we prove (iv).First, by the Cauchy-Schwarz inequality, where denotes the objective function that we defined in (15) and vec denotes the vectorization operator in Algorithm A4.By Markov's inequality, we have for each δ > 0 that We define the notation M := sup x:F →G (vec(A x ), W ), which is finite because (vec(A x ), W ) ≤ A x
Remark G.8. Suppose that G = G and denoising = F in Theorem G.7 (iv).The left-hand side of ( 28) is a measure of the difference between the original network G = (V, A) and the limiting reconstructed network Ĝ∞ = (V, Â∞ ) that we compute using the NDR algorithm (see Algorithm 2) with network dictionary W ∈ R k 2 ×r ≥0 .Recall that the columns of W encode r latent motifs L 1 , . . ., L r ∈ R k×k ≥0 (see Section B.4).According to (28), G = Ĝ∞ if the right-hand side of (28) is 0. This is the case if sup x:F →G (vec(A x ), W ) = 0, which means that W can perfectly approximate all mesoscale patches A x of G.However, the right-hand side of (28) can still be small if the worst-case approximation error sup x:F →G (vec(A x ), W ) is large but the expected approximation error E x∼π [ (vec(A x ), W )] is small (i.e., when W at effective in approximating most of the mesoscale patches).
How can we find a network dictionary W that minimizes the right-hand side of (28)?Although it is difficult to find a globally optimal network dictionary W that minimizes the non-convex objective function in the right-hand side of (28), Theorems G.2 and G.5 guarantee that our NDL algorithm (see Algorithm 1) always finds a locally optimal network dictionary.Indeed, from these theorems, the NDL algorithm with N = 1 computes a network dictionary W that is approximately a local optimum of the following expected loss function: where π = πF →G if MCMC = PivotApprox and π = π F →G otherwise.The function f in (33) appears in the upper bound in (29).In our experiments, we found that our NDL algorithm produces network dictionaries that are efficient in minimizing the reconstruction error.See the left-hand side of ( 28) and, e.g., Figure 3. Another implication of Theorem G.7(iii) is also relevant to network denoising (see Figure 4).Suppose that we have an uncorrupted network G = (V, B) and a corrupted network G = (V, A).Additionally, suppose that we have trained the network dictionary W for the uncorrupted network G , but that we use it to reconstruct the corrupted network G.Even if Âx;W is a nonnegative linear approximation of the k × k matrix A x of a mesoscale patch of the corrupted network G, it may be close to the corresponding mesoscale patch B x of the uncorrupted network G , because we used the network dictionary W that we learned from the uncorrupted network G .Theorem G.7(iii) guarantees that the network Ĝ∞ that we reconstruct for the corrupted network G using the uncorrupted-network dictionary W is close to the uncorrupted network G .
Remark G.9.The update step (see line 17) for global reconstruction in Algorithm 2 indicates that we loop over all node pairs (a, b) in the k-chain motif and that we update the weight of edge (x t (a), x t (b)) in the reconstructed network using the homomorphism x : F → G.There may be multiple node pairs (a, b) in F that contribute to the edge (p, q) in the reconstructed network, because x t (a) = p and x t (b) = q can occur for multiple choices of (a, b).The output of this update step does not depend on the ordering of a, b ∈ {1, . . ., k}, as one can see from the expressions in (30).
One can also consider the following alternative update step for global reconstruction.Specifically, we first choose two nodes p, q of the reconstructed network in the image {x t (j) | j ∈ {1, . . ., k}} changed to j of the homomorphism x t and average over all pairs (a, b) ∈ [k] 2 such that (p, q) is visited by (a, b) through x t , and we then update the weight of (p, q) in the reconstructed network with this mean contribution from x t .Specifically, for each a, b ∈ {1, . . ., k}, let 1 (p, q) xt ← (a, b) denote the indicator that we defined in (24) and let N pq (x t ) ≥ 0 be the number of visits of x t to (p, q) (see (25)).We can then replace line 17 in Algorithm 2 with the following line: Alternative update for global reconstruction: For p, q ∈ V such that N pq (x t ) > 0: For the alternative NDR algorithm that we just described, we can establish a convergence result that is similar to Theorem G.5 using a similar argument as the one in our proof of Theorem G.7. Specifically, (i) holds for the alternative NDR algorithm, so there exists a limiting reconstructed network.In the proof of (ii), the formula for the limiting reconstructed network is now where Ω pq is the set of all homomorphisms that visit (p, q) (see (26)).In particular, if G is an undirected and binary graph, then Â∞ (p, q) = 1 |Ω pq | y∈Ωpq A y;W (p, q) for all p, q ∈ V .
In the proofs of (iii) and (iv), the same error bounds hold with E x∼π [N pq (x)] replaced by P x∼π (x ∈ Ω pq ).We omit the details of the proofs of the above statements for this alternative NDR algorithm.
We now discuss convergence results of Algorithm 2 for a bipartite network G. Recall the notation and discussions about bipartite networks above Theorem G.5.Additionally, recall for our bipartite networks that there exist disjoint subsets Ω 1 and Ω 2 of the set Ω of all homomorphisms F → G such that (1) Ω = Ω 1 ∪ Ω 2 and (2) the Markov chain (x) t≥0 restricted to each Ω i (with i ∈ {1, 2}) is irreducible but is not irreducible on the set Ω.
Theorem G.10 (Convergence of the NDR Algorithm (see Algorithm 2) for Bipartite Networks).Let F = ([k], A F ) be the k-chain motif, and let G = (V, A) be a network that satisfies assumption (a') in Theorem G.5.Let Ĝt = (V, Ât ) denote the network that we reconstruct using Algorithm 2 at iteration t using a fixed network dictionary W ∈ R k 2 ×r .Fix i ∈ {1, 2} and an initial homomorphism x 0 ∈ Ω i .Let π = πF →G if MCMC = PivotApprox and π = π F →G otherwise.The following properties hold: (i) The network Ĝt converges almost surely to some limiting network Ĝ∞ = (V, Â∞ ) in the sense that lim t→∞ Ât (p, q) = Â∞ (p, q) almost surely for all p, q ∈ V .Proof.The proofs of statements (i)-(iv) are identical to those for Theorem G.7. Statement (v) follows from a similar argument as in the proof of Theorem G.5(ii) by constructing coupled Markov chains (x t ) t≥0 and (x t ) t≥0 such that x t = x t for all t ≥ 0.
Remark G.11.Our proofs of Theorems G.7 and G.10 do not depend on the particular choice of mask Φ F ;x that we use to define the mesoscale patches A x in (2).
Remark G.12.In Theorem G.10(ii), let Ĝ(i) ∞ = (V, Â(i) ∞ ) denote the limiting reconstructed network for G conditional on the Markov chain being initialized in Ω i for i ∈ {1, 2}.When k is even, Theorem G.10(iv) implies that Ĝ(1) ∞ .When k is odd, we run the NDR algorithm (see Algorithm 2) twice with the Markov chain initialized in both Ω 1 and Ω 2 .We then define the network Ĝ∞ := (V, ( Â(1) ∞ + Â(2) ∞ )/2), whose weight matrix is the mean of those of the two limiting reconstructed networks Ĝ(i) ∞ for i ∈ {1, 2}.We obtain a similar error bound as in Theorem G.10(iii) for this mean limiting reconstructed network.In practice, one can obtain a sequence of reconstructed networks that converges to the mean reconstructed network Ĝ∞ by reinitializing the Markov chain every τ iterations of the reconstruction procedure for any fixed τ .

Appendix H. Auxiliary Algorithms
We now present auxiliary algorithms that we use to solve subproblems of Algorithms 1 and 2. Let Π S denote the projection operator onto a subset S of ambient space.For each matrix A, let [A] •i (respectively, [A] i• ) denote the i th column (respectively, i th row) of A.

Figure 1 .
Figure 1.Illustration of mesoscale structures that we learn from (a) images and (b,c) networks.In all experiments in this figure, we form a matrix X of size d × n by sampling n mesoscale patches of size d = 21 × 21 from the corresponding object.For the image in panel (a), the columns of X are square patches of 21 × 21 pixels.In panels (b) and (c), we show both heat maps and adjacency matrices.We take the columns of X to be the k × k adjacency matrices of the connected subgraphs that are induced by a walk of k = 21 nodes, where a walk of k nodes consists of k nodes x1, . . ., x k such that xi and xi+1 are adjacent for all i ∈ {1, . . ., k}.Using nonnegative matrix factorization (NMF), we compute an approximate factorization X ≈ W H into nonnegative matrices, where W has r = 25 columns.Because of this factorization, we can approximate any sampled mesoscale patches (i.e., the columns of X) of an object by a nonnegative linear combination of the columns of W , which we can interpret as latent shapes for images and latent motifs (i.e., subgraphs) for networks, respectively.The network dictionaries of latent motifs that we learn from the (b) UCLA and (c) Caltech Facebook networks reveal distinctive social structures.For example, if we uniformly sample a chain of 21 friends in one of these networks, we observe for Caltech that there are likely to be communities with six or more nodes and also some 'hub' users who know most of the others in the sample.However, for UCLA, it is unlikely that there are such communities or hubs.In the heat map of the UCLA network, we show only the first 3000 nodes according to the node labeling in the data set.

Figure 2 .
Figure2.Latent motifs that we learn from 14 networks (eight real-world networks and six synthetic networks, which include two distinct instantiations from each of three random-graph models) at five different scales (specifically, for k = 6, 11, 21, 51, 101), which reveal distinct mesoscale structures in the networks.Using network dictionary learning (NDL), we learn network dictionaries of r = 25 latent motifs of k nodes for each of the 14 networks.For each network at each scale, we show only the second-most dominant latent motifs from each dictionary.These motifs include more information than the most dominant motifs for these sparse networks.Black squares represent 1 entries and white squares represent 0 entries.See Section D.2 in our SI for details of how we measure latent motif dominance, and see Figure6in our SI for the most dominant latent motifs of each network.

Figure 3 .
Figure3.Self-reconstruction and cross-reconstruction accuracy between several real-world and synthetic networks versus the edge threshold value θ and the number r of latent motifs in a network dictionary.The label X ← Y indicates that we reconstruct network X using a network dictionary that we learn from network Y .The reconstruction process produces a weighted network that we turn into an unweighted network by thresholding the edge weights at a threshold value θ, such that we keep only edges whose weight is strictly larger than θ.We measure reconstruction accuracy by calculating the Jaccard index of the original network's edge set and the reconstructed network's edge set.In panel (a), we plot accuracies versus θ (keeping the number of latent motifs fixed at r = 25), where X is one of five real-world networks (two PPI networks, two Facebook networks, and one collaboration network).In panels (b) and (c), we reconstruct each of the four Facebook networks using network dictionaries with r ∈ {9, 16, 25, 36, 64, 81, 100} latent motifs that we learn from one of ten networks (with the edge threshold value fixed at θ = 0.5).

Figure 4 .
Figure 4. Application of the NDL and NDR algorithms to network denoising with additive and subtractive noise on a variety of networks from empirical data sets.In panels (a)-(e), we plot our results.In panel (f ), we compare our denoising results for −50% noise on SNAP FB, H. sapiens, and arXiv versus those of other methods.In our experiments with subtractive noise, we corrupt a network by removing a uniformly random subset of 10% or 50% of its edges, and we seek to classify the removed edges and the non-edges as true edges and false edges, respectively.In our experiments with additive noise, we corrupt a network by uniformly randomly adding 10% or 50% of the number of its edges, and we seek to classify the edges and non-edges in the resulting corrupted network as 'negative' (i.e., false edges) or 'positive' (i.e., true edges).To perform classification in a network, we first use NDL to learn latent motifs from a corrupted network and then reconstruct the networks using NDR to assign a confidence value to each potential edge.We then use these confidence values to infer membership of potential edges in the uncorrupted network.Importantly, we never use information from the original networks.For the Caltech Facebook network in panel (b), we also perform edge inference and denoising using a network dictionary that we learn from the MIT Facebook network.For each network, we indicate the receiver-operating characteristic (ROC) curves and corresponding area-under-the-curve (AUC) scores for network denoising with additive noise using the labels +10% and +50%, and we indicate the ROC curves and corresponding AUC scores for network denoising with subtractive noise using the labels −10% and −50%.
that maps the edges of F to some edges of G, so it maps F onto a subgraph of G.It thereby maps F 'into' G.For each homomorphism x : F → G from a motif F = ([k], A F ) into a network G = (V, A), we define a k × k matrix A x (a, b) := A x(a), x(b) Φ F,x (a, b) for all a, b ∈ {1, . . ., k} ,

Figure 5 .
Figure 5. Illustration of our network dictionary learning (NDL) algorithm (see Algorithm 1).(a) Homomorphisms xt : F → G from a k-chain motif into a target network G evolve as a Markov chain to yield a sequence of k-chain subgraphs (the green edges) in G.(b) Each copy of the k-chain motif in G induces a k-node subgraph (i.e., the mesoscale patch Ax t that we defined in (2)).(c) We form a sequence of matrices Xt of size k 2 × N , where the N columns of each Xt are vectorizations of the N consecutive k × k mesoscale patches in panel (b).(d) Using an online nonnegative matrix factorization (NMF) algorithm, we progressively learn the desired number of latent motifs as the data matrix of mesoscale patches Xt arrives.

Figure 6 .
Figure 6.The most dominant latent motifs that we learn from 14 networks (eight real-world networks and six synthetic networks, which are single instantiations from random-graph models) at five different scales (specifically, for k = 6, 11, 21, 51, 101).Using our NDL algorithm, we learn network dictionaries of r = 25 latent motifs of k nodes for each of the 14 networks.For each network at each scale, we show only the most-dominant latent motifs from each dictionary.We showed the associated second-most dominant latent motifs in Figure 2. Black squares indicate 1 entries and white squares indicate 0 entries.

1 Figure 7 .
Figure 7.Comparison of r = 25 latent motifs that we learn from Coronavirus PPI using the NDL algorithm (see Algorithm 1) in four different settings, which arise from the choices mask ∈ {Id, NF} and MCMC ∈ {PivotApprox, Glauber}.The choices mask = Id and mask = NF indicate that the NDL algorithm uses mesoscale patches (2) with the identity mask (3) and the no-folding mask (4), respectively.The choices MCMC = PivotApprox and MCMC = Glauber indicate that the NDL algorithm uses the Glauber chain (see Algorithm MG) and the approximate pivot chain (see Algorithm MP with AcceptProb = Approximate), respectively.The other parameter values are λ = 1, N = 100, and T = 100.We use black squares for 1 entries and white squares for 0 entries.The numbers underneath the latent motifs give their dominance scores.

Algorithm 2 . 1 : 2 : 3 :
Network Denoising and Reconstruction (NDR) Input: Network G = (V, A) , and network dictionary W ∈ R k 2 ×r ≥0 Parameters: F = ([k], A F ) (a motif) , T ∈ N (number of iterations) , λ ≥ 0 (the coefficient of an L 1 -regularizer) , θ ∈ [0, 1] (an edge threshold) Options: denoising ∈ {T, F} , mask ∈ {Id , NF} , MCMC ∈ {Pivot, PivotApprox, Glauber} 4: Requirement: There exists at least one homomorphism F → G 5: Initialization: 6: tensor that we obtain by reshaping each column of Y using Algorithm A5.Let Y be a k × k × m tensor that we obtain from Y by Y (a, b, c) = Y (a, b, c)1(A F (a, b) = 0) for all a, b ∈ {1, . . ., k} and c ∈ {1, . . ., m} Let Y off be a k 2 × m matrix that we obtain from Y by vectorizing each of its slices using Algorithm A4: Y [:, :, c] for all c ∈ {1, . . ., m}.Output: Matrix (Y ) off ∈ R k 2 ×m variants of the NDR algorithm.The variant is specified by the Boolean variable denoising.The NDR algorithm with denoising = F is identical to the network-reconstruction algorithm in [30, Algorithm 2], except for thresholding step.The NDR algorithm with denoising = T is a new variant of NDR that we present in the present work for the purpose of network denoising.

F. 3 . 1 .F. 4 . 7 .
, 8, 9, 10, 11, 12, 13, and 14.These figures give latent motifs of the networks that we described in Section F.1 using Algorithm 1 with various parameter choices.In all of these figures, we use a chain motif with the corresponding network F = ([k], A F ) for T = 100 iterations, N = 100 homomorphisms per iteration, an L 1 -regularizer with coefficient λ = 1, a mask of mask = Id, and an MCMC motif-sampling algorithm of MCMC = Pivot.We specify the number r of latent motifs and the scale k in the caption of each figure.Figure In this figure, we illustrate latent motifs that we learn from UCLA and Caltech and we compare them to an image dictionary.We use the following parameters in Algorithm 1 to generate the results in Figures2 and 6.We use a chain motif with the corresponding network F = ([k], A F ), a scale k = 21 for T = 100 iterations, N = 100 homomorphisms per iteration, r = 25 latent motifs, an L 1 -regularizer with coefficient λ = 1, a mask of mask = Id, and an MCMC motif-sampling algorithm of MCMC = Pivot.The pivot chain in Algorithm MP uses AcceptProb = Approximate.The image dictionary for the artwork Cycle in Figure1uses an algorithm that is similar to Algorithm 1, except that we uniformly randomly sample 21 × 21 square patches of the image instead of k × k mesoscale patches of a network.Figure This figure compares r = 25 latent motifs at scale k = 21 that we learn from Coronavirus PPI using Algorithm 1 with MCMC motif-sampling algorithms MCMC ∈ {Glauber, ApproxPivot} and masks mask ∈ {Id, NF}.We specify the parameters in Algorithm 1 that we use to generate the results in Figure7in the caption of that figure.

F. 5 . 3 .
Figure To generate Figure3, we first apply the NDL algorithm (seeAlgorithm 1)  to each network that we consider in the figure to learn r = 25 latent motifs for a chain motif with the corresponding network F = ([k], A F ), a scale k = 21 for T = 100 iterations, N = 100 homomorphisms per iteration, an L 1 -regularizer with coefficient λ = 1, a mask of mask = NF, and MCMC = PivotApprox.For each self-reconstruction X ← X (see the caption of Figure3), we apply the NDR algorithm (see Algorithm 2) to a chain motif with the corresponding network F = ([k], A F ), a scale k = 21 for T = n ln n iterations (where n is the number of nodes in the network), N = 100 homomorphisms per iteration, r = 25 square patches, an L 1 -regularizer with coefficient λ = 0 (i.e., no regularization), a mask of mask = Id, a MCMC motif-sampling algorithm of MCMC = PivotApprox, and denoising = F.For each cross-reconstruction Y ← X (see the caption of Figure3), we apply the NDR algorithm (see Algorithm 2) to a chain motif with the corresponding network F = ([k], A F ), a scale k = 21 for T = n ln n time steps (where n is the number of nodes in the network), N = 100 homomorphisms, an edge-threshold value of θ = 0.5, an L 1 -regularizer with coefficient λ = 0 (i.e., no regularizatin), a mask of mask = NF, an MCMC motif-sampling algorithm of MCMC = PivotApprox, and denoising = F.We use multiple different choices of the number r of latent motifs; we indicate them in the caption of Figure3.

Figure 4 .
To generate Figure4, we first apply the NDL algorithm (seeAlgorithm 1)  to each corrupted network that we consider in the figure to learn r = 25 latent motifs for a chain motif with the corresponding network F = ([k], A F ), a scale k = 21 for T = 400 iterations, N = 1000 homomorphisms per iteration, an L 1 -regularizer with coefficient λ = 1, a mask of mask = NF, and an MCMC motif-sampling algorithm of MCMC = PivotApprox.The NDR algorithm (see Algorithm 2) that we use to generate the results in Figure4uses r = 25 latent motifs for a chain motif with the corresponding network F = ([k], A F ), a scale k = 21 for T = 400, 000 iterations for H. sapiens and T = 200, 000 iterations for all other networks, an L 1 -regularizer with coefficient λ = 1, a mask of mask = NF, MCMC = PivotApprox, and denoising = T.For Figure4, we did not conduct the denoising experiment for Coronavirus PPI with −50% noise because the resulting network (with 1,536 nodes and 1,232 edges) cannot be connected.(To be connected, its spanning trees need to have 1,535 edges.) using all of the |E|/2 false non-edges and all of true non-edges, which is often very large.For instance, recall that arXiv has |V | = 18,722 nodes and |E| = 198,110 edges.Therefore, after deleting |E|/2 edges to create a corrupted network, there are |E|/2 = 99,055 false non-edges and |V |(|V | − 1)/2 − |E| = 175,049,171 true non-edges.Such an imbalance in a data set does not affect the ROCs and hence does not affect the AUCs.Consequently, the true-positive rates and false-positive rates do not change even if we compute them after independently sampling |E|/2

Figure 11 .
Figure 11.The 25 latent motifs for r ∈ {9, 16, 25, 36, 49} at scale k = 21 that we learn from the networks Caltech, MIT, and UCLA.The r = 25 column is identical to the k = 21 column in Figure 8.The numbers underneath the latent motifs give their dominance scores (see Section D.2). See Section F for details of the experiments.

Figure 12 .
Figure 12.The latent motifs for r ∈ 16, 25, 36, 49} at scale k = 21 that we learn from the networks Coronavirus PPI, SNAP Facebook, arXiv ASTRO-PH, and Homo sapiens PPI.The r = 25 column is identical to the k = 21 column in Figure 9.The numbers underneath the latent motifs give their dominance scores (see Section D.2). See Section F for details of the experiments.

Table 1 .
).Consequently, a linear approximation of mesoscale patches A x of G obs of the latent Area-under-the-curve (AUC) scores of the ROC curve for our network-denoising experiments using NDL (see Algorithm 1) and NDR (see Algorithm 2).As in Figure [11,23]eviously was studied as part of the Facebook5 data set[55]), has 762 nodes and 16,651 edges.Nodes represent users in the Facebook network of Caltech on one day in fall 2005, and edges represent Facebook 'friendships' between these users.(2)MIT:Thisconnectednetwork,which is part of the Facebook100 data set[56], has 6,402 nodes and 251,230 edges.Nodes represent users in the Facebook network of MIT on one day in fall 2005, and edges represent Facebook 'friendships' between these users.(3)UCLA:Thisconnectednetwork,which is part of the Facebook100 data set [56], has 20,453 nodes and 747,604 edges.Nodes represent users in the Facebook network of UCLA on one day in fall 2005, and edges represent Facebook 'friendships' between these users.(4)Harvard:Thisconnectednetwork,which is part of the Facebook100 data set [56], has 15,086 nodes and 824,595 edges.Nodes represent users in the Facebook network of MIT on one day in fall 2005, and edges represent Facebook 'friendships' between these users.(5)SNAPFacebook(SNAPFB)[24]:Thisconnected network has 4,039 nodes and 88,234 edges.Thisnetwork is a Facebook network that has been used as a benchmark example for edge inference[11].(6)arXivASTRO-PH(arXiv)[11,23]: This network has 18,722 nodes and 198,110 edges.Its largest connected component has 17,903 nodes and 197,031 edges.We use the full network in our experiments.It is a collaboration network between authors of papers in astrophysics that were posted to the arXiv preprint server.Nodes represent scientists and edges indicate coauthorship relationships.(7) Coronavirus PPI (Coronavirus): This connected network, which was curated by theBiogrid.org [10,44, 52] from 142 publications and preprints, has 1,546 proteins that are related to coronavirues and 2,481 protein-protein interactions (in the form of physical contacts) between them.We downloaded this data set on 24 July 2020.Among the 2,481 interactions, 1,546 are for SARS-CoV-2 and were reported by 44 publications and preprints; the rest are related to coronaviruses that cause Severe Acute Respiratory Syndrome (SARS) or Middle Eastern Respiratory Syndrome (MERS).(8)Homo sapiens PPI (H.sapiens) 01 0.01 0.01 0.01 0.01 0.00 0.00