Main

With the improvements in sequencing technology, techniques to investigate biological samples have become increasingly refined, progressing from sequencing in bulk, to single-cell RNA1 and spatial transcriptomics2,3. The latter technique allows one to obtain the organization of cells in tissue, leading to deeper insights in biology and improving the detection of diseases4,5. However, current techniques for spatial transcriptomics rely on fluorescence microscopy, which is limited in throughput, especially for large amounts of targets3.

DNA microscopy is an emerging spatial transcriptomic technique that aims to find the spatial organization of DNA or RNA using sequencing alone, bypassing the use of optical microscopy. The common theme in all DNA microscopy methods is to use a polymerase chain reaction (PCR) and sequencing to find pairs of adjacent molecules and use that pairing information to find their relative locations6,7,8,9,10 (Fig. 1). In a typical workflow (Fig. 1a), molecules are barcoded and amplified locally in the tissue of interest, creating polymerase colonies (polonies)11 (Fig. 1b). Where polonies overlap, amplicons fuse by design and form concatemers (Fig. 1c), which, when sequenced, reveal which two polonies are adjacent. All the adjacency data are represented in a graph, where each node represents one polony, and each edge a sequenced concatemer. From this graph, the original locations can be estimated6,7,9,10 (Fig. 1d). Importantly, the information of these adjacency pairs is obtained with sequencing information only, meaning that (1) the method can capture both the sequence and location of transcripts simultaneously, and many targets can be captured simultaneously, (2) it does not require processing or stitching of image data, only analysis of sequencing data, and (3) it is not inherently limited to two-dimensional (2D) reconstructions, but can be used to reconstruct 3D samples as well.

Fig. 1: DNA microscopy reveals spatial locations by finding adjacent pairs of transcripts.
figure 1

a, RNA is present in a biological sample of interest. b, RNA molecules are barcoded and amplified locally, forming polymerase colonies (polonies). c, Where polonies overlap, their amplicons can be engineered to fuse together, forming concatemers. d, Sequencing concatemers reveals the adjacency of all polonies, which can be displayed in a graph. The number of concatemers formed between two polonies is the edge weight. From this graph, the relative spatial coordinates of each molecule can then be obtained.

However, experimental conditions can give rise to erroneous signals that create artifacts in the adjacency graph and disrupt the spatial reconstruction. We consider two such errors. The first of these is that of spurious crosslinks, formed between any pair of nodes regardless of position. These can be formed by incomplete PCR during post-experimental library preparation when products are no longer spatially confined, in a reaction similar to barcode-swapping and index-hopping events12. The second type of error is a fused node. When two polonies contain the same barcode or very similar barcodes that are mistakenly fused by sequencing error correction, they are represented in the adjacency graph as a single node, which can lead to distortions in the reconstruction.

In this Article we propose two methods to remove these errors, collectively called MinIPath (‘Minimum Indirect Path’ analysis). These methods are based on graph analysis on the adjacency graph alone. Spurious crosslinks are detected by finding a short indirect path connecting two directly connected nodes, which we show is harder to find if two nodes are far away and erroneously connected. Fused nodes are detected by looking at the connected nodes of any node. When these can easily be separated into two indirectly connected groups, the node is probably a fused node and can be split. We show the effect of both types of error on the reconstruction quality using simulated diffusion-based data as input, and that these can be corrected by our method. In addition, we analyze a previously described DNA microscopy dataset7 and show that we can obtain accurate reconstructions by removing spurious crosslinks more efficiently than with a read count filter. In summary, this method provides an efficient way to filter artifacts from adjacency-based data, which can improve the overall quality of the resulting spatial reconstruction.

Results

Correcting errors in simulated data using graph organization

Although the principle of imaging space by sequencing can be realized in various experimental set-ups, we focus here on a set-up introduced previously7 that has yielded experimental results. In this arrangement, the sample is encapsulated in a hydrogel, meaning that polony formation is governed by diffusion. Two types of seed strand are present in the sample, to prevent the self-interactions of polonies, which could consume large amounts of sequencing data without providing information on neighbors. Furthermore, each formed concatemer contains a unique barcode, allowing one to count the number of interactions between two polonies. All the neighborhood interactions are represented in a neighborhood graph, where the weight of the edges equals the number of observed products between two polonies. As two types of seed strand are used, the neighborhood graph is a weighted, undirected, bipartite graph.

We first sought to simulate this experimental set-up. Starting from Fick’s law for diffusion for a single polony, one can derive the relationship between the reaction rate ω between two polonies and their distance7:

$${\omega \left(i,\,j\right)\propto {t}^{-\frac{d}{2}}\times \exp \left(\frac{-{\left|{{\bf{x}}_{i}}-{{\bf{x}}_{j}}\right|}^{2}}{{L}_{\rm{diff}}^{2}}\right)}$$
(1)
$${L}_{\rm{diff}}={\sqrt{8{dDt}}}$$
(2)

where ω(i, j) is the reaction rate between two polonies i and j, each of a different type, D is the diffusion constant, t is the time since polony creation, d is the number of dimensions, and xi and xj are the locations of polonies i and j, respectively. Ldiff describes the characteristic diffusion length of the distribution.

As polony input, we randomly distributed nodes of two types within a 2D shape (Fig. 2a; total dimensions 200 × 450 pixels; shape area 40,000 pixels; 4,000 nodes). The diffusion model was used to estimate the number of reactions between each pair of nodes, and this number was used as a parameter in the Poisson distribution to obtain connections and their edge weights:

$${w\left(i,\,j\right)}={{\rm{Poiss}}\left(a\times \exp \left(\frac{-{\left|{{\bf{x}}_{i}}-{{\bf{x}}_{j}}\right|}^{2}}{{\sigma }^{2}}\right)\right)}$$
(3)

where ω(i, j) represents the edge weight between two nodes i and j, a represents the amplitude and σ the spread. In experimental terms, the amplitude can be affected by the inherent reactivity of the polonies and sequencing depth, both of which determine how many products are seen. If the amplitude is high, multiple concatemers are formed and/or sequenced for each polony pair, resulting in larger edge weights in the resulting graph. The spread is analogous to Ldiff and determines the distance at which two polonies might still react. Experimentally, it will be determined by the diffusion, and therefore by the properties of the hydrogel and size of the products.

Fig. 2: Reconstructions quality depends on Gaussian parameters that determine adjacency.
figure 2

a, Node locations used as input, on a grid of 200 × 450 pixels. b, Edges are formed between neighboring nodes. The spread determines the range at which edges are formed, and the amplitude determines the edge weight. c, The reconstruction quality depends on the amplitude and spread. dg, Example reconstructions with different spreads and amplitudes: good local, poor global quality (d), good local, good global quality (e), poor local, good global quality (f) and poor local, poor global quality (g).

Source data

As a baseline, we generated adjacency data using a wide range of amplitudes and spreads (Fig. 2b) and reconstructed the polony locations using the previously described spectral maximum likelihood embedding (sMLE) method7. We reconstructed all adjacency datasets where at least 80% of all nodes were connected in a single group, and evaluated these reconstructions using two metrics: the Procrustes disparity as a global metric, and the number of overlapping neighbors out of the 15 nearest as a local metric9.

The different parameters used in the simulation had a great influence on the reconstruction quality (Fig. 2c). Globally accurate reconstructions were obtained for many of the simulations. Only when a very low spread (σ = 10) or a combination of high spread and amplitude was used (σ = 200, a = 100) did the reconstructions become less accurate based on the global metric. Local accuracy depended primarily on the spread: if it was small (σ ≤ 50), local accuracy was high even when global accuracy was low (k-nearest neighbors (KNN) ≥ 0.75; Fig. 2d), but when it started to approach the sample size, local accuracy decreased (Fig. 2f,g). Higher amplitudes also led to better local reconstructions, and the combination of low spread and high amplitude led to the most accurate reconstructions both globally and locally (Fig. 2e).

We then developed algorithms to detect and correct spurious crosslinks and fused nodes (Fig. 3). As mentioned above, spurious crosslinks randomly connect any two nodes of different types, regardless of the position of their corresponding polonies in the original sample (Fig. 3a). Similarly, node fusion results in nodes that inherit the connections of two randomly selected nodes, again regardless of the position of their corresponding polonies in the sample (Fig. 3b).

Fig. 3: Spurious crosslinks and fused nodes can be detected by indirect path analysis.
figure 3

a, Spurious crosslinks, which connect two nodes of different types, regardless of distance. b, Fused nodes, in which two nodes are fused into one node, which inherits both of their edges. c, Spurious crosslink removal: obtain indirect paths of length three. For normal, local connections, one can find many short indirect paths connecting the same two nodes. However, when two nodes are spuriously connected and are sufficiently far apart, fewer short indirect paths will be found between them. d, Fused node splitting: separate connected nodes into two groups: normal nodes and fused nodes. A normal node is connected to several other nodes, which form a well-connected subgraph using short, indirect paths. Partitioning this subgraph removes many edges. However, the nodes connected to a fused node form two well-connected subgraphs, with only a few connections between them. Partitioning this graph therefore requires the removal of fewer edges.

For spurious crosslinks, we reasoned that polonies are usually surrounded by many other polonies to which a connection is possible. This implies that when two nodes are connected in a graph, it should be straightforward to find a short, indirect path in the graph to connect these two nodes, without using the edge itself (Fig. 3c, left). By contrast, spurious crosslinks are formed between polonies regardless of their distance. The further they are apart, the less likely it will be that a short, indirect path between the two nodes can be found (Fig. 3c, right). Using edge weights further amplifies the difference, as longer distance between nodes results in lower edge weights (equation (3)). To exploit this difference, for each edge, we find all indirect paths of length three (the minimum in a bipartite graph), calculate the product of the edge weights of each path and sum these together, to calculate what we will refer to as the indirect path value of that edge. We then use this to distinguish between normally connected nodes and spuriously connected nodes, removing the edge if it is below a certain cutoff.

For fused nodes, we built on the same reasoning that polonies are typically surrounded by other polonies to which a connection is possible. The nodes connected to any single, unaltered node should therefore also be connected to each other, with short, indirect paths, and form a single well-connected subgraph (using indirect paths of length two, the minimum in a bipartite graph; Fig. 3d, top). For the connected nodes of fused nodes, however, this is not necessarily the case. If the two polonies that are represented by a single fused node were sufficiently far apart in the sample, the connected nodes of the fused node should form two well-connected subgraphs, with many connections within each subgraph, but few connections between them (Fig. 3d, bottom). The original groups should therefore be obtainable with spectral graph partitioning13, an algorithm that seeks to obtain graph partitions by minimizing the number of edges removed between them, while maximizing the number of edges within each partition. The ratio between these is called the normalized cut (ncut)14. The connected nodes of unaltered nodes can of course also be partitioned in two, but this requires the removal of many more edges, resulting in a higher normalized cut. The normalized cut can therefore serve to distinguish between fused and unaltered nodes. When it is below a given cutoff, it is replaced by two nodes, each inheriting the connections to the nodes in either of the graph partitions (Methods section Algorithm description).

To examine whether these algorithms proved effective, we first either added spurious crosslinks (1–20% of total edge weights) or fused nodes (1–20% of all nodes) in the simulated data. Adding in these errors distorted the reconstructions (for example, Fig. 4a). The global reconstruction quality was affected more than the local reconstruction quality, which can be understood as a small number of errors that twist the reconstruction without affecting the nearby neighbors of each point. We found that using a large spread when simulating connections made the resulting reconstructions more robust to errors, while the amplitude used in the simulation had little effect (Extended Data Fig. 1). When using a large spread, adding in spurious crosslinks is more likely to connect two nodes that are already connected, or fusing two nodes whose connected nodes broadly overlap, explaining the increased robustness to errors. Furthermore, because fusing nodes introduces nodes with twice as many connections, we verified that the resulting long-tailed distribution of node connectivity itself was appropriately normalized by the reconstruction pipeline and did not influence the reconstruction. To this end, we randomly introduced an increased reactivity bias to randomly selected nodes (1–20%) and found that this did not affect the reconstruction quality (Extended Data Fig. 1).

Fig. 4: Indirect path analysis rescues reconstructions by removing spurious crosslinks and correcting fused nodes.
figure 4

a,d, Example distorted reconstructions at amplitude 10, and width 50 (a) and 20 (d), with 20% errors. b,e, Indirect paths and normalized cuts distributions of the pairing data used to generate the reconstructions in a (b) and d (e). The arrow indicates the cutoff used. c,f, Reconstructions after correction, corresponding to a and d. gj, Average fraction of true positive (g,i) and false positive (h,j) across all simulated datasets. kn, Average local (k,m) and global (l,n) reconstruction qualities before and after corrections across simulated datasets that were affected by the errors (with amplitude of 1, 10 or 100 and width of 20, 50 or 100).

Source data

We applied our algorithm to the simulated data to calculate all indirect path values and normalized cuts, obtaining distributions of each for each simulated case (for example, Fig. 4b,e). On average, the spurious crosslinks connected nodes at larger distances from each other than normal connections, although there was some overlap (154.43 ± 90.43 for spurious crosslinks compared to 111.36 ± 74.91 for normal edges; average taken across all simulations). Still, the indirect path values were lower for spurious crosslinks compared to normal connections, with the exact values depending on the amplitude and spread used in the simulation, as well as the number of other errors (Extended Data Fig. 2). Similarly, the normalized cut values were lower for fused nodes (0.47 ± 0.33 for fused nodes compared to 0.71 ± 0.17 for normal nodes; average taken across all simulations). Again, distance played an important role here to detect each of the errors, as the crosslinks formed at longer distances had lower indirect path values, and fused nodes whose constituents were originally at a large distance could be partitioned with lower normalized cuts (Extended Data Fig. 2).

We then applied a range of cutoffs by taking the lower quantiles of all indirect path values or normalized cuts. Edges below their cutoffs were removed, and nodes with normalized cuts below their cutoff were split. Although not all errors could be removed, the correction algorithms preferentially corrected the spurious crosslinks and fused nodes over original edges and unaltered nodes on average across all simulations (Fig. 4g–i). The ratio of correctly identified errors (true positive) to original data identified as errors (false positive) depended primarily on the spread, that is, errors were more easily identified when only local reactions were formed (Extended Data Fig. 3). Notably, these reconstructions were also the ones that were most affected by the introduced errors in the first place (Extended Data Fig. 1). In addition, applying the algorithm without the presence of errors did not affect the reconstructions, except when an exceptionally high cutoff (quantile of 0.50) was used, suggesting that the error correction algorithm can safely be used on unaltered data, even at slightly too high cutoffs, without affecting reconstruction quality.

Applying the correction algorithm improved the reconstruction quality of any reconstruction that was affected by the errors (Extended Data Figs. 47). Zooming in on simulations that were initially accurate but strongly affected by the errors (20 ≤ spread ≤ 100, amplitude ≥ 1.0), the average Procrustes disparity increased from 0.02 ± 0.03 to 0.63 ± 0.10 and to 0.52 ± 0.12, when spurious crosslinks or fused nodes were added, respectively. However, the average quality improved by applying the correction algorithm (Fig. 4k–n), and for each case where the introduced errors affected the reconstruction, a cutoff could be found that restored it (Extended Data Figs. 47). For most cases, this quantile cutoff was equal to the fraction of introduced errors, although when a higher cutoff was used, the reconstruction quality did not decrease. Using the expected fraction of errors in the data as a quantile cutoff therefore seems an appropriate guideline.

When fused nodes were corrected, we also evaluated the node splitting accuracy, that is, whether the nodes after splitting were connected to the same nodes as before the nodes were fused (Methods, Simulated data processing). We found that the resulting groups matched accurately to the original (average Jaccard index: 0.73 ± 0.17), and most accurately when the spread in the simulation was low (Extended Data Fig. 8). Node splitting only proved ineffective when spread and amplitude were low. In these cases, nodes were typically connected to only a few other nodes, which themselves were not indirectly connected to each other. The resulting indirectly connected graph was therefore often disconnected into multiple components, similarly to when a node is fused, making it impossible to identify whether the graph is easily partitioned due to sparse data or due to a fused node (Extended Data Fig. 8).

In summary, the proposed method removes spurious crosslinks and corrects fused nodes across a wide range of simulated data, even in the presence of up to 20% spurious crosslinks of 20% fused nodes.

Correcting disruptive crosslinks in experimental data

To see how our method would perform on experimental data, we analyzed a previously published DNA microscopy dataset for which a reference image was available7. In this experimental set-up, specific types of RNA transcript (ACTB for ‘beacons’, and gfp, rfp and gapdh as ‘targets’) were used as seeds for the two types of polony in the bipartite graph. Each product connecting two polonies could be recognized by a unique barcode called a unique event identifier (UEI), the number of which was used as the edge weight between the respective nodes.

Using the same pipeline as described in the previous work7, we extracted 1.26 × 105 polonies with 6.72 × 105 edges and 9.55 × 105 unique UEIs from the raw sequencing data. When using all data to reconstruct the largest connected group, the resulting reconstructions frequently collapsed into a ‘star-like’ pattern (Extended Data Fig. 9). Only one out of ten reconstructions produced a layout that could be overlaid on the microscopy image with a poor match (Fig. 5b). To remove possible artifacts, Weinstein et al.7 applied a read count filter that removed all products without sufficient reads. Although this strategy did improve the reconstruction quality, a read count filter of four was required to obtain an accurate reconstruction (Fig. 5c,e). Given the large number of products produced during a DNA microscopy reaction, many of them had low read counts (Extended Data Fig. 10), and, as a result, only 70.6% (6.39 × 105/9.03 × 105) of all UEIs remained for reconstruction.

Fig. 5: Applying a minimum indirect path filter efficiently removes topological artifacts from experimental data.
figure 5

af, Data loss is calculated as the total number of UEIs remaining for reconstruction after the applied filter. White scale bars:100 µm. Striped scale bars: 1 Ldiff. a, Reference microscopy image. b, Reconstruction created from uncorrected and unfiltered data. c,e, Reconstructions after read count filters of 3 (c) and 4 (e). d,f, Reconstructions after indirect path filters of 1 (d) and 2 (f). g, Indirect path distribution of length 3. h,i, Number of remaining UEIs, edges, and beacon and target polonies after the read count filter (h) and the indirect path filter (i). Figure adapted with permission from: af, adapted from ref. 7, Elsevier.

Source data

By contrast, applying a minimum indirect path cutoff of 1 greatly improved the reconstruction quality, while only removing 3.6% of all UEIs (3.2 × 104/9.03 × 105; Fig. 5d). The reconstruction quality further improved with an indirect path cutoff of 2 (removing 4.8% of all UEIs; Fig. 5f). Applying an even higher cutoff did not clearly further improve the reconstruction. Using an indirect path cutoff therefore removed disruptive edges more efficiently than a read count filter, allowing more data to be used for the resulting reconstruction.

We also attempted to split possible fused nodes in this dataset. The subgraphs formed from the nodes connected to any particular node often formed more than two connected components (8.9 × 104/1.1 × 105; 82.4%), similar to the simulated datasets with low amplitude and spread, meaning it could not be used for accurate partitioning. Indeed, when describing the experimental data in terms of amplitude and spread, we found it had low amplitude and average spread. The dimensions of the sample of an accurate reconstruction (Fig. 5f) were 9 × 8 Ldiff2 equivalents, that is, 11–13 times as long as the average distance of two connected nodes (0.68± 0.59 Ldiff2 equivalents), suggesting an average to low spread. Of all 1.94 × 107 polony pairs that were within the average pairing distance, only 6.72 × 105 edges (3.5%) were obtained from the sequencing data, similar to a low amplitude in the simulation. Possible fused nodes could therefore not be identified.

Overall, applying a minimum indirect path filter efficiently removed disruptive crosslinks from the experimental data, losing only a small percentage of all connections. Because the fidelity of the reconstruction scales with the amount of available data, the proposed algorithm could provide a useful filter to obtain more accurate reconstructions from sparse data.

Discussion

We note a few observations, shortcomings and directions for future investigations. First, MinIPath did not exclusively or completely remove spurious crosslinks from simulated datasets, as the indirect path values for both partially overlapped, regardless of spread or amplitude. An adaptation of the algorithm taking longer indirect paths of five or seven steps may provide a better distinction between the erroneous and non-erroneous edges. The same principle could be applied to improve the performance of the node splitting algorithm on sparse data, to still allow one to find groups of indirectly connected nodes when fewer connections are present. However, such adaptations do come at an increased computational cost. Calculating the indirect paths of length k has an expected runtime performance of \({O\left({d}^{k}\times {|E|}\right)}\), where d is the average degree, and |E| is the number of edges in the connection graph. Using longer indirect paths in the fused node algorithm will also result in more connections between the two groups of connected nodes, which might increase the number of false negatives. Such an adaptation would therefore have to be carefully characterized.

Second, it is possible that for edges connecting nodes in sparse areas of the samples (such as at the edge), lower indirect path values are calculated. To correct for these inaccuracies, the indirect path values would have to be corrected and normalized using the degree of each node. Such adaptations may be required for samples with a larger variation in the node density across the sample.

Finally, we have applied our error correction algorithm here on DNA microscopy data, but we note that the same principle could be applied on any dataset where adjacency is the primary source of data, such as Hi-C data15. Several methods have been described to obtain the 3D organization from pairing data between genomic regions16, and it remains unclear, to our knowledge, how artifacts affect these. For this purpose, the method could be adapted to work on non-bipartite graphs. How errors affect these reconstructions and whether this correction algorithm can improve them remains a topic for future studies.

Methods

Algorithm description

As input, the algorithms take undirected, bipartite, weighted graphs, here called G:

$${G=(U,\,V,\,E,\,w)}$$
(4)
$${E\subseteq U{\rm{\times }}V}$$
(5)
$${w\left(x,\,y\right)=\left\{\begin{array}{c}d\in {\mathbb{N}}\\ 0\end{array}\right.\begin{array}{c}{\rm{if}}\,\left(x,\,y\right)\in E\\ {\rm{if}}\,\left(x,\,y\right)\notin E\end{array}}$$
(6)

U and V are the independent sets of nodes defining the bipartition of the graph, E is the set of edges, and w is a function assigning an edge weight d to each connected pair of nodes (d is not necessarily a constant, but varies per node pair).

Spurious crosslinks are identified by counting the number of short indirect paths between two nodes connected by an edge. We calculate the indirect path value at three steps (wi3) by taking the product of the edge weights of each of the three edges that form one indirect path between the two nodes x and y, then adding these products together for all indirect paths connecting the nodes:

$${{w}_{i3}\left(x,\,y\right)=\sum _{\left(a,\,b\right){\rm{|}}(a\in V;\,a\ne y;\,b\in U;\,b\ne x)}w\left(x,\,a\right)\times w(a,\,b)\times w(b,\,y)}$$
(7)

For fused node correction, let Sa denote the immediate neighbors of some node a. Although these nodes do not have edges among themselves (because they belong to the same bipartition), they can be indirectly connected at two steps. For convenience, we form the graph Ga from the nodes Sa, the edges Ea and the edge weights wi2:

$${G}_{a}=({S}_{a},\,{E}_{a},\,{w}_{i2})$$
(8)

where the edges \({E}_{a}\subseteq {S}_{a}^{2}\) are formed between the nodes in Sa, and the edge weights are given for each \({\left(x,\,y\right)\in {S}_{a}^{2}}\):

$${{w}_{i2}\left(x,\,y\right)=\left\{\begin{array}{l}\mathop{\sum}\limits _{b\in U;\,b\ne a}w(b,\,x)\times w\left(b,\,y\right)\,{\rm{if}}\,a\in U\\ \mathop{\sum}\limits_{b\in V;\,b\ne a}w(x,\,b)\times w(\,y,\,b)\,{\rm{if}}\,a\in V\end{array}\right.}$$
(9)

Naturally, if wi2(x, y) = 0, x and y do not have an edge. Note that, in contrast to G, Ga is not a bipartite graph. Also, the edges to the original node a are not used to find the edges in wi2; that is, indirect rather than direct paths are used.

To obtain the partition of Ga, we first check whether Ga is naturally disconnected into multiple components. If it consists of exactly two disconnected components, those are used as the partitions of Ga, with a normalized cut value of 0.0. If more than two components are found, the node is marked as unevaluable (due to sparse data). Otherwise, we apply spectral graph partitioning13 to partition Ga into two components, Sa1 and Sa2. The partition is evaluated by calculating the normalized cut14. The cut for this specific partition is first calculated by taking the sum of the edge weights removed between sets Sa1 and Sa2 by the partitioning:

$${{\rm{cut}}\left({S}_{a1},\,{S}_{a2}\right)=\sum _{\left(u,\,v\right){\rm{|}}u\in {S}_{a1};\,v\in {S}_{a2}}{w}_{i2}(u,\,v)}$$
(10)

and the normalized cut is calculated by dividing the cut by the sum of the edge weights in each partition14:

$${\rm{ncut}}\left({S}_{a1},\,{S}_{a2}\right)=\frac{{\rm{cut}}({S}_{a1},\,{S}_{a2})}{\sum _{\left(u,\,t\right)|u\in {S}_{a1};\,t\in {S}_{a}}{w}_{i2}(u,\,t)}+\frac{{\rm{cut}}({S}_{a1},\,{S}_{a2})}{\sum _{\left(u,\,t\right)|u\in {S}_{a2};\,t\in {S}_{a}}{w}_{i2}(u,\,t)}$$
(11)

When the normalized cut is below the cutoff, node a is removed, and two new nodes are created in G that each inherit the edges of node a to either the nodes in Sa1 or Sa2.

Algorithm implementation

Two methods were implemented in Python to calculate weighted indirect paths. The first method starts from the asymmetric adjacency matrix \({A\left(i,\,j\right)=w(i,\,j)}\), then calculating the three-step adjacency matrix \({{A}_{3}\left(i,\,j\right)=A\times {A}^{\rm{T}}\times A}\), then, for each node pair with an edge, subtracting the paths that use their direct edge, while setting other node pairs to 0:

$${{A}_{i3}\left(i,\,j\right)=\left\{\begin{array}{c}{A}_{3}\left(i,\,j\right)-w\left(i,\,j\right)\times \left(\sum _{k}{w\left(i,\,k\right)}^{2}+\sum _{k}{w\left(\,j,\,k\right)}^{2}-{w}_{{ij}}\right),\,A(i,\,j) > 0\\ 0, \qquad A(i,\,j)=0\end{array}\right.}$$
(12)

Calculating A3 becomes computationally challenging for large datasets, partly because it calculates all paths of length three, not just those between nodes that were originally connected. We therefore implemented the calculation of only the indirect paths of length three between nodes connected in the original dataset, using sparse matrices. This method first calculates \({{A}_{2}\left(i,\,j\right)=A\times {A}^{\rm{T}}}\), which contains all two-step paths between all nodes of one partition (for example, U). It then iterates over every edge to find all indirect paths of length three.

Pseudocode

For node a in V:  For every node b connected to a (i.e., A(ib, ia) > 0):    common nodes = set(nodes connected to b in two steps) U             set(nodes connected to a)             (that is, where(A2(ib,:) > 0) U where(A(:, ia) > 0))    For node c in common node indices:     Ai3(ic, ia) += (A2(ib, ic) – A(ib, ia) * A(ic, ia)) * A(ib, ia)

Here ix denotes the index of node x. The second method is implemented in Python with Numba17 acceleration to allow for the use of multiple threads.

For node splitting, the graph Ga was extracted and partitioned using a spectral graph partitioning tool from the scikit-learn package18. Nodes were not considered for splitting if they had fewer than four connections, or if GS consisted of more than two components. Normalized cuts were calculated first for all nodes, then nodes were selected for splitting according to the applied cutoff. If two nodes with an edge were both split, the edge was removed.

Simulated data processing

Polony locations were reconstructed using the sMLE method described previously7. A slightly adapted version of the pipeline was used to process large amounts of files more easily. In contrast to experimental data, simulated data were not subjected to the iterative minimum UEI filter before reconstruction. Analysis was done with custom scripts in Python v3.9.12, using the packages numba v.0.53.1 (ref. 17), numpy v1.22.4 (ref. 19), pandas v.1.4.4 (ref. 20), scikit-learn v1.1.3 (ref. 18) and scipy v1.9.3 (ref. 21), and visualized with seaborn v.0.11.2 (ref. 22). Reconstructions from graphs where the largest connected component was smaller than 80% of all nodes were not considered for further analysis. Global reconstruction quality was assessed using the Procrustes disparity, and local reconstruction quality was assessed by the overlap of the k-nearest neighbors for each node in the original and reconstruction positions, as suggested previously9.

To evaluate the node splitting accuracy, we paired each set of nodes Sa1 and Sa2 to their closest match among the two sets of nodes in Sb and Sc (i.e. the nodes connected to the original nodes b and c that made up the fused node), and calculated the overlap:

$${\rm{Overlap}}={\frac{\sum _{n\in \{1,\,2\}}\max \left(\frac{{S}_{{an}}\cap {S}_{b}}{{S}_{{an}}},\,\frac{{S}_{{an}}\cap {S}_{c}}{{S}_{{an}}}\right)}{2}}$$
(13)

Experimental data processing

Raw sequencing data were downloaded from the Sequencing Read Archive (project no. PRJNA487001, sample 3), and processed as described previously, without a minimum read count. After then applying either a read count filter or an indirect path filter, the remaining nodes were filtered as described earlier7, first by iteratively removing nodes with fewer than two associated products to remove possible uncorrected sequencing errors, then by selecting the largest connected component. Properties displayed in Fig. 5h,i are derived from the graphs after all filters were applied.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this Article.