A Metric on the Space of kth-order reduced Phylogenetic Networks

Phylogenetic networks can be used to describe the evolutionary history of species which experience a certain number of reticulate events, and represent conflicts in phylogenetic trees that may be due to inadequacies of the evolutionary model used in the construction of the trees. Measuring the dissimilarity between two phylogenetic networks is at the heart of our understanding of the evolutionary history of species. This paper proposes a new metric, i.e. kth-distance, for the space of kth-order reduced phylogenetic networks that can be calculated in polynomial time in the size of the compared networks.

Phylogenetic networks play a vital role in the description of the evolutionary history of species, and are especially appropriate for datasets whose evolutions contain significant amounts of reticulate events caused by recombination, hybridization, horizontal gene transfer, gene duplication, gene conversion and loss [1][2][3][4][5][6][7] . Even for the species which have evolved based on a tree-like model of evolution, phylogenetic networks can be used to represent conflicts in phylogenetic trees that may be caused by inadequacies of an used evolutionary model. So far, there have been many algorithms and programs for constructing phylogenetic networks. The assessment of the algorithms for constructing phylogenetic networks is mainly by means of the comparison of the networks, for example, comparing the constructed network with simulate network or actual network. In addition, comparing two phylogenetic networks can help us to understand the evolutionary history of species. Recently, researchers have shown an increased interest in definition of metrics for computing the dissimilarity between a pair of phylogenetic networks.
A measure d is called a metric on a space S if it satisfies four properties: for any a, b, c ∈ S: • d(a, b) ≥ 0 (nonnegative); In general, it is much easier to prove a defined measure to satisfy the above-mentioned properties except the reflexivity. For a metric, if two phylogenetic networks are isomorphic, the distance between them computed by the metric is 0, otherwise it is 1; then we say that the metric is trivial. A trivial metric satisfies obviously above-mentioned properties, but it doesn't show other information about evolutionary history implied by the two phylogenetic networks. Accordingly, in addition to these four properties, it is desired that the metric can give us some information on the dissimilarity of the evolutionary histories expressed by the phylogenetic networks being compared [8][9][10][11][12][13] .
Up to now, several metrics have been designed and proven that each one of them is a metric on a certain subspace of rooted phylogenetic networks, for example, μ-metric on the space of tree-sibling phylogenetic networks 14 , the tripartition metric on the space of tree-child phylogenetic networks [15][16][17][18] , the m-distance on the space of reduced phylogenetic networks 19 , and the d e -distance on the space of partly reduced phylogenetic networks 20 . The largest one among those subspace is the partly reduced phylogenetic networks, so the d e -distance is also the metric on the subspaces of tree-child phylogenetic networks, tree-sibling phylogenetic networks and reduced phylogenetic networks. The paper will introduce a new metric, denoted by kth-distance, on space of kth-order reduced phylogenetic networks (will be discussed in the following sections), and the metric is polynomial-time computable. The space of kth-order reduced phylogenetic networks is larger subspace of rooted phylogenetic networks than any one subspace on which has been defined a metric. If no special instructions, the rest of paper will use the network to denote the rooted phylogenetic network.

Preliminaries
Let  be a set of taxa. A rooted phylogenetic network N = (V, E) on  is a directed acyclic graph (DAG for short), with one root node, and its leaves labelled as  by a bijection f.
For a network N = (V, E) and a node u ∈ V, if: • indeg(u) = 0, then u is the root; • indeg(u) ≤ 1, then u is a tree node; • indeg(u) ≥ 2, then u is a reticulate node; • outdeg(u) = 0, then u is a leaf; • outdeg(u) ≥ 1, then u is an internal node. Sometimes we use the notation N = ((V, E), f) to denote the network N, and V N to denote the leaf set of N. Given two nodes u, v ∈ V. If (u, v) ∈ E, then we say that v is a child of u or u is a parent of v. If there exists a directed path from u to v, then we say that v is a descendant of u or u is an ancestor of v.
The height of a node u is the length of a longest directed path beginning from u and ending with a leaf. The non-existence of cycles indicates that all nodes of N can be categorized by height: the nodes with height 0 are the leaves; for a node u with height a > 0, each child of u has height m < a and there exists at least one child with height exactly a − 1.
The depth of a node v is the length of a longest directed path beginning from the root and ending with v. In the same way, the non-existence of cycles indicates that all nodes of N can be categorized by depth: the only node with depth 0 is the root; for a node v with depth b > 0, each parent of v has depth m < b and there exists at least one parent with depth exactly b − 1.

Definition 1.
For two networks N 1 = ((V 1 , E 1 ), f 1 ) and N 2 = ((V 2 , E 2 ), f 2 ), they are isomorphic if and only if there exists a bijection H from V 1 to V 2 such that: Although the subspace defined by the d e -distance is the largest one among all defined subspaces, there exist a large number of networks that aren't measured by the d e -distance. For example, the two networks in Fig. 1 (from the paper 20 ) are not isomorphic, while the d e -distance between them is 0. Even for two non-isomorphic networks whose d e -distance is not 0, the distance is usually maximal value 1. For example the networks in Fig. 2, there is a certain resemblance between them, so it is desired that the distance between them is less than 1. However, their d e -distance is maximal value 1. On the other hand, for any two networks N 1 on  1 and N 2 on  2 , the d e -distance

Methods
Let N = ((V, E), f) be a network. Now we begin to give several definitions for the same network.
Example 1. Consider the network N 1 in Fig. 1. Each node of N 1 is first-order equivalent with itself, and C ≡ 1 E, D ≡ 1 F, H ≡ 1 J.

Definition 3.
Given an even number k ≥ 2. Two nodes u, v ∈ V (not necessarily different) are called kth-order equivalent, denoted by u ≡ k v, if u ≡ k−1 v, and: • u, v are the root, or

Example 2.
Consider the network N 1 in Fig. 1 again. Each node of N 1 is second-order equivalent with itself, and H ≡ 2 J. Each node of N 1 is only kth-order equivalent with itself (k ≥ 3).

Lemma 1.
Here k is an odd number. Given nodes u 1 , u 2 , , u s in a network, if each u i has l children, and each child of u i is only kth-order equivalent with itself (1 ≤ i ≤ s). Then u 1 ≡ k u 2 ≡ k  ≡ k u s if and only if u 1 , u 2 , , u s have the same children (refer to the Fig. 4).

Lemma 2.
Here k is an even number. Given nodes v 1 , v 2 , , v s in a network, if each v i has l parents, and each parent of v i is only kth-order equivalent with itself. Fig. 5).

Lemma 3. For all leaves, the root and the nodes with height 1 in a network, each of them is kth-order equivalent with itself (for any k).
The proofs of Lemmas 1, 2 and 3 aren't listed here. It can be concluded from these definitions that each kth-order equivalence is an equivalence relation, i.e. it is transitive, reflexive and symmetric. It can be easily proved that all the first-order equivalent nodes have the same height and all the kth-order equivalent nodes (k ≥ 2) have the same height and depth (refer to the literature 20 ).
If a node u is kth-order equivalent with other nodes except itself, we say that u has non-trivial kth-order equivalent nodes. For a network, after deleting the non-trivial kth-order equivalent nodes of each node, as well as the nodes with indegree 1 and outdegree 1, the resulting network is called the kth-order reduced phylogenetic network. All the kth-order reduced phylogenetic networks form the space of kth-order reduced phylogenetic network. So a network N is in the space of kth-order reduced phylogenetic networks, if and only if each node of N is only kth-order equivalent with itself.
The space of first-order reduced phylogenetic networks is the space of reduced phylogenetic networks defined in the paper 19 . The space of second-order reduced phylogenetic networks is the space of partly reduced phylogenetic networks defined in the paper 20 . Figure 6 shows the relationship of these subspaces.    The space of kth-order reduced phylogenetic networks is not equals to the space of rooted phylogenetic network. For example the network N in Fig. 7, for any k, each node of N is kth-order equivalent with itself, and A ≡ k B. So N isn't the kth-order reduced phylogenetic network, i.e. not in the space of kth-order reduced phylogenetic networks.
In order to compute the dissimilarity of the networks, we will extend the above concepts defined in a network to two networks in the following sections. Let N 1 = ((V 1 , E 1 ), f 1 ) and N 2 = ((V 2 , E 2 ), f 2 ) be two networks.

Definition 6.
Given an even number k ≥ 2. Two nodes u ∈ V 1 , v ∈ V 2 are called kth-order equivalent, denoted by u ≡ k v, if u ≡ k−1 v, and: • u, v are the root, or • node u has l(≥1) parents u 1 , u 2 , , u l , node v has l parents v 1 , v 2 , , v l , and u i ≡ k v i for 1 ≤ i ≤ l.

Definition 7.
Given an odd number k ≥ 2. Two nodes u ∈ V 1 , v ∈ V 2 are called kth-order equivalent, denoted by u ≡ k v, if u ≡ k−1 v, and: Let u, u 0 be two nodes from two networks or the same network. From these definitions, it follows that if there exists a positive integer k 1 , such that u ≢ u k 0 1 , then for any k > k 1 , u ≢ k u 0 . Given two networks N 1 = (V 1 , E 1 ) and N 2 = (V 2 , E 2 ). We use the following processes to compute the kth-order unique nodes of N 1 , denoted by L k (N 1 ). First L k (N 1 ) = ∅. Then for each node u ∈ V 1 , if there has no node u 0 ∈ L k (N 1 ) such that u ≡ k u 0 , add u to L k (N 1 ). Similarly, we can compute L k (N 2 ). For each node u ∈ L k (N 1 ), e u ( ) N k 1 denotes the number of nodes which are kth-order equivalent with u, i.e.
for each node u ∈ L k (N 2 ). For the sake of simplicity, we drop the subscript of e. Here e k (∅) = 0.

Lemma 4. Given two networks N
Proof. Refer to the proof of the Theorem 15 in the paper 20 .
is more than d. From the definition 8, it follows that the 1st-distance is the m-distance defined in the space of reduced phylogenetic networks, and the 2nd-distance is the d e -distance defined in the space of partly reduced phylogenetic networks. (N 1 , N 2 ) = 0. Then |V 1 | = |V 2 |, and there exists a node v 0 ∈ L i (V 2 ) for each node v ∈ L i (V 1
Proof. If N 1 and N 2 are isomorphic, obviously d k (N 1 , N 2 ) = 0. The converse conclusion will be proven as follows.
Lemma 5 tells us that |V 1 | = |V 2 |. From the property of the kth-order reduced phylogenetic networks, it follows that each node u in V 1 is just kth-order equivalent with itself and u ∈ L k (V 1 ). Similarly, each node v in V 2 is just kth-order equivalent with itself and v ∈ L k (V 2 ). Moreover, for each node u ∈ V 1 , there exists the only one node v ∈ V 2 such that u ≡ k v. So we define a mapping H from V 1 to V 2 , for each node u ∈ V 1 , H(u) = u′, where u′ ∈ V 2 and u′ ≡ k u.
First we prove that the mapping H is a bijection. For any two different nodes u 1 , u 2 ∈ V 1 , there exist two nodes ′ ′ ∈ u u V , 1 2 Here ′ u 1 and ′ u 2 are not the same nodes. If not, then u 1 ≡ k u 2 . It contradict that each node u ∈ V 1 is just kth-order equivalent with itself. So H is injective. Due to |V 1 | = |V 2 |, we have that H is a surjection. Then If k is an odd number, then the children of u are kth-order equivalent with the children of u 0 respectively. Thus, v is kth-order equivalent with a child v′ of u 0 , i.e. v′ ≡ k v ≡ k v 0 . Since every node is only kth-order equivalent with itself, v′ and v 0 are the same nodes, i.e. v 0 is a child of u 0 . Therefore, (u 0 , v 0 ) ∈ E 2 . Similarly, we can come to the conclusion when k is an even number.
The mapping H also preserves the labels of the leaves from the definition of kth-order equivalence. In conclusion, N 1 and N 2 are isomorphic. From Lemmas 6, 7 and 8, we have the following result:

Theorem 9
The kth-distance defined by the formula 1 is a metric on the space of kth-order reduced phylogenetic networks.
Let k = 3 and n j the number of nodes of network N j (j = 1, 2). Consider the two networks in Fig. 1. For i = 1 and 2 , . So the d(N 1 , N 2 ) = 1/3. Consider two networks in Fig. 2. The nodes R, B, E, F, K in V 1 don't exist first-order equivalent nodes in V 2 , while the nodes R, B, F in V 2 don't exist first-order equivalent nodes in V 1 . Everyone else has only one first-order equivalent node. So  (N 1 , N 2 ) = 0 for all k. Then there exists a positive integer m, such that for any m 0 ≥ m, we have that each node u in V 1 has a m 0 th-order equivalent node u′ in V 2 .
Proof. Assume that the above conclusion does not hold, i.e. for any positive integer m, there exist k 0 ≥ m and a node u ∈ V 1 , such that u′ ≢ u k 0 for any node u′ ∈ V 2 . So when m = 1, there exist k 1 and u 1 ∈ V 1 , such that u 1 ≢ ′ u . This conclusion is in contradiction with d k (N 1 , N 2 ) = 0 for all k. ◽ Computational Aspects. For odd number k (or even number k), the kth-order equivalent nodes can be computed by a bottom-up (or top-down) approach, no matter whether the nodes are in the same network or two different networks. Given two networks N 1 = ((V 1 , E 1 ), f 1 ) and N 2 = ((V 2 , E 2 ), f 2 ). Algorithm 8 shows the pseudocode that decides whether two nodes are kth-order equivalent or not, where E(k) is the abbreviation for the set of kth-order equivalent nodes. This process will cost at most O(n 3 ) time, where n = max(|V 1 |, |V 2 |). Therefore, it takes totally at most O(n 5 ) time to find out all ith-order (where 1 ≤ i ≤ k) equivalent nodes for each node of the two networks. Computing the formula 1 will costs O(n) time. In conclusion, we will spend O(n 5 ) time in computing the kth-distance between two networks, where n is the maximum of |V 1 | and |V 2 |.

Results
We compared the kth-distance with m-distance on the space of reduced phylogenetic networks 19 and the d e -distance on the space of partly reduced phylogenetic networks 20 , by means of 100 networks constructed by the Lnetwork method 3 . Thus, each distance method can obtain a distance matrix with approximately 5000 values.