Combinatorial characterization of a certain class of words and a conjectured connection with general subclasses of phylogenetic tree-child networks

The combinatorial study of phylogenetic networks has attracted much attention in recent times. In particular, one class of them, the so-called tree-child networks, are becoming the most prominent ones. However, their combinatorial properties are largely unknown. In this paper we address the problem of exactly counting them. We conjecture a relationship with the cardinality of a certain class of words. By solving the counting problem for the words, and on the basis of the conjecture, several simple recurrence formulas for general cases arise. Moreover, a precise asymptotic analysis is provided. Our results coincide with all current formulas in the literature for particular subclasses of tree-child networks, as well as with numerical results obtained for small networks. We expect that the study of the relationship between the newly defined words and the networks will lead to further combinatoric characterizations of this class of phylogenetic networks.

D Solution of the first order difference equation E Proof of the asymptotic formula, Proposition 7 F Generator of a general class of words C n,k A Proofs of the statements about means and dispersions Proposition 1. Words of the type α σ(1) α σ(2) · · · α σ(n) a b c · · · α n ∈ B n , where σ is any permutation of {1, 2, . . . , n}, have: a) The highest mean distance, specificallyx = n − 1, among all words belonging to B n . b) A maximum variance equal to (n 2 − 1)/3, attained by the reversing order permutation: σ(i) = n + 1 − i.
Proof. a) To maximize the distance between characters, the second repetition of a character needs to occur after all other characters have appeared. If it was not the case, i.e. there is an arbitrary letter α whose second appearance happens before all other letters have appeared once, the movement of the second appearance of α to the right, down to the left of the second appearance of any other letter that occurs first will increase the distance between the couple of α letters in m units, m ≥ 1 by hypothesis, and the distance between the other jumped letters would be increased in one unit as well.
The "prefix condition" stated in Definition 2 enforces last n characters to appear in lexicographic order. Regarding the arrangement of the first n letters, note that for the identity permutation all distances are equal to n − 1. To see that any permutation leads to the same mean distance just observe that any arbitrary movement of a single letter m positions to the right (left) decreases (increases) the distance with its counterpart in m units but at the same time each of the m "jumped over" characters increase (decrease) their distance in one unit. b) Taking into account that the distance between characters is x(i) = n − 1 + σ(i) − i, and using a) we have: As stated by the "rearrangement inequality", the rightmost summation is minimized by the reversing order permutation. Then Proposition 2. Words of the type zy · · · rq aabb · · · pp qr · · · yz have a maximal Standard Deviation among all words belonging to B n . This maximum value tends toward SD ∝ 5− Proof. The first part of the proof has the objective to reduce the "search space" for words with largest variance. It is proved by induction that words with largest variance have the structure shown in the enunciate. That is, with an unspecified length s of the prefix string zy · · · rq, which shall determined by the end of the proof. In this fashion, the possibility s = 0, that never corresponds to a word with largest variance, will be ever considered.
For n = 2 the enunciate holds because the word with largest variance is baab, which belongs to the search space S = {aabb, baab}. By the induction hypothesis, we shall assume that for every m < n, words with largest variance have the stated structure. Now we want to prove that this assumption implies that any potential word with length 2n has the same structure.
Given any word w ∈ B n , consider the initial string of m letters placed just before the second appearance of a. These m letters, conforming the set we call X, need to be distinct because the first letter (starting from the left) that can be repeated twice is a. Notice that the s rightmost letters of the word have to be the s last letters of the alphabet and they have to appear in lexicographic order. In addition to this, the second appearances (to the right of a) of the letters in X need also to appear in lexicographic order, but with the possibility to have some other letters intercalated. Then, by the rearrangement inequality, we know that the arrangement of the s leftmost letters in reverse lexicographic order leads to the largest variance. Let us now perform the appropriate permutation on the initial s-substring and look for the lowest k 1 (≤ s) such that the letter placed in the position k 1 , say w k1 ≡ α is different to a n+1−k1 ≡ β. Now interchange letter w k1 with the first appearance of the missing letter β, located at the position k 3 , i.e. w k3 = β. Next we show that this operation ever raises the variance. Since the first appearances of two letters have been interchanged, the mean distance is not modified, and thus we only need to analyze the change in the sum of squared distances. If the second appearances of letters α and β occur at positions k 2 and k 4 respectively, the change in the variance is Therefore the change is always positive because, evidently, k 3 > k 1 and the "prefix condition" forces k 4 > k 2 .
The following step is to perform similar swaps with the rest of the letters of the (updated) X set. The process will end with a word having an s-prefix formed by the last s letters of the alphabet placed in reverse lexicographic order, and ending with an s-suffix formed by the same last s letters, but placed in lexicographic order. No other operation can be performed on the prefix or suffix to increase the variance. Note that the central string will now belong to B n−s , then by the induction hypothesis we conclude that the overall construction has the correct structure.
The length of the initial tail zy · · · rq depends on n. To fix it, one can leave it as a parameter. Next we calculate the variance in terms of this parameter and we seek those values of s that maximize the variance.
In terms of the length s of the prefix, the mean distance of such a word with 2n letters is And the variance is given by Considering s to be a continuous parameter, the real function Var has a unique maximum in the interval 0 < s < n, which can be determined by imposing the extreme condition: By inspecting the solutions of this cubic polynomial, and taking the limit n → ∞, we find that the length of the prefix scales as Substituting this value into Eqs. (A1) and (A2) yields And finally the Standard Deviation is determined with an standard asymptotic expansion: Proposition 3. Words of the type a σ(1) a σ(2) · · · a σ(n) a b c · · · a n a b c · · · a n ∈ A n , where σ is any permutation of {1, 2, . . . , n}, have: a) The highest mean distance, specificallyx = n − 1, among all words belonging to A n .
b) A maximum variance equal to (n 2 − 1)/6, attained by the reversing order permutation: Proof. a) To maximize the distance between characters, the second repetition of a character, starting from the left, needs to occur after all other characters have appeared, as explained in the proof of Proposition 1. By the same argument the third repetition of a character needs to occur after all other characters have been appeared twice. Then, to maximize the mean distance, the letters must be grouped in three strings of n distinct letters.
To fulfill the "prefix condition" the second and third substrings (starting from the left) need to appear in lexicographic order. Regarding the arrangement of the first n letters, note that for the identity permutation all distances are equal to n − 1. Clearly the same reasoning in Proposition 1 applies, and any permutation leads to the same mean distance. b) Taking into account that the distance between the first and second appearances is x(i) = n − 1 + σ(i) − i, and every distance between the second and third appearance of a letter is equal to n − 1, we have: i σ(i) .
As stated by the "rearrangement inequality", the rightmost summation is minimized by the reversing order permutation. Then (n + 1 − 2i) 2 = 1 6 (n 2 − 1) Unfortunately we have not been able to formally prove, as in Proposition 2, that words in A n with largest variance are of the form zy · · · rq aaabbb · · · ppp qqrr · · · zz. However, it is intuitively clear that it must be so. Largest variances will be achieved in words having the largest possible distances between letters, and the least distances, actually zero. Clearly, largest distances are obtained by the following construction w(a, b, . . . , z) = z w (a, b, . . . , y) zz .
In this fashion, we are adding the maximum distance and the minimum one. Notice that the repetition of any letter can not be at the beginning of the word because the prefix condition would prevent the formation of any gap. Once largest distances are assured, in general the possibility to have more letters at zero distance is left open, to keep the large distances they can only correspond to groups of three repeated letters placed at the central part of the word. Again, due to the prefix condition, the repetitions need to appear in lexicographic order.
Proposition 4. Assuming that words of the type zy · · · rq aaabbb · · · ppp qqrr · · · zz have a maximal Standard Deviation among all words in A n . The length s of the suffix zy · · · rq that maximizes the variance tends to (2− √ 2) n − 1 4 + O 1 n as n −→ ∞. Implying that the maximal Standard Deviation Proof. Let us calculate the variance in terms of the length s of the initial tail zy · · · rq. The mean distance of such word isx And the variance is given by Considering s to be a continuous parameter, the real function Var has a unique maximum in the interval 0 < s < n, which can be determined by imposing the extreme condition: By inspecting the solutions of this cubic polynomial, and taking the limit n → ∞, we find that the length of the prefix scales as Substituting this value into Eqs. (A1) and (A2) yields And finally the Standard Deviation is determined with an standard asymptotic expansion: Next all words belonging to the class C 3,1 are displayed in the same order as Algorithm A (see section F) visits them. Applying Algorithm E to any word returns the indicated ordinal number.  a a b b c c a  20 : a a b a b c c  39 : b a c a b a c  2 : a b a b c c a  21 : a b a a b c c  40 : a a c a b b c  3 : b a a b c c a  22 : b a a a b c c  41 : a a c b a b c  4 : a a b c b c a  23 : a a b b c a c  42 : a b c a a b c  5 : a b a c b c a  24 : a b a b c a c  43 : b a c a a b c  6 : b a a c b c a  25 : b a a b c a c  44 : a c a b b a c  7 : a a c b b c a  26 : a a a b c b c  45 : a c b a b a c  8 : a b c a b c  The proof shall go in two stages: the first part proves Eq. (1), with coefficients a i being independent of (n, k). The second part, which essentially involves the property c n,n = c n,n−1 , will deal with the explicit form for {a i }.
Let us suppose by induction hypothesis that the expressions corresponding to c n−1,k and c n,k−1 hold true: From recursive relation (3), we have In this fashion, we prove that the aforementioned c n,k comply with relation (7). Regarding coefficients a i , relation (8) is obtained from the property c n,n = c n,n−1 as follows: Equating the two previous expressions we obtain a n as

D Solution of the first order difference equation
To determine the solution of the recurrence y m+1 = a m y m + b m , with a m = 0, divide each term by the following common product In terms of the new variable Eq. (D1) reads Finally perform the telescopic sum Only remains to isolate y m+1 in the definition (D2)

E Proof of the asymptotic formula, Proposition 7
To derive expression (9) it is enough to consider the asymptotic expansion of the double factorial function appearing in the numerator of Eq. (7) for n large. The key point here is that the part of the argument that involves the summation index i generates a n −i/2 factor. Thus, for big n only the first terms in the series a i contribute to the sum. The asymptotic behavior of the double factorial can be found in the literature (see, for instance, http://functions.wolfram. com/06.02.06.0008.01). For integer m, it reads as: We now make the substitution m → 2n + 2k − i − 1. In order to obtain a power series for big n it suffices to develop ln(1 + x) and e −x as Taylor series. With the definitions we now obtain: (2n + 2α − 1) n+α = (2n) n+α e α−1/2 1 + (2α + 1)β + 1 6 (12α 2 − 4α − 13)β 2 As far as the sum of powers is concerned, S ≡ 1 + 1 6 m + 1 72 m 2 − . . . , defining γ ≡ 1 2n , we have Collecting all contributions, we get To obtain the asymptotic expression (11) we need to add up the first i = 0 . . . 4 terms of the sum (9), apply the just obtained asymptotic series (E4) and group the terms in powers of n. In order to get a simple formula we also applied the following equality: F Generator of general classes of words C n,k Next we provide an algorithm, based on the recurrence schema described in the subsection 3.1 of the main text, that sequentially generates, with low memory requirements, all words in C n,k . Algorithm A (General Words Generator ): Given numbers n and k, the algorithm visits all words w = c 1 c 2 . . . c 2n+k in C n,k starting by the word 1122 . . . nn123 . . . k. Variable j is the name the letter (number), placed at position p, we wish to move. An auxiliary vector a 1 a 2 · · · a n is used to track the number of available interchanges.
Set a i ← 0 for 1 ≤ i < n and set a n ← min(k, n − 1).
A3. [Try to move left. ] Find least p such that c p = j. Find greatest q < p such that c q < j. If q is found set c p ← c q and c q ← j. Also set a i ← 0 for 1 ≤ i < j − 1 and go to A2.
A4. [Try to move right. ] Find least q > p such that c q = j. Find least s > q such that c s < j (s < n + q). If found and c s ≤ a j then set c p ← c s , c s ← j and actualize the auxiliary vector: set a j−1 ← min(j − 1, a j−1 + 1) and a i ← 0 for 1 ≤ i < j − 1 and go to A6.