Identifying likely transmissions in Mycobacterium bovis infected populations of cattle and badgers using the Kolmogorov Forward Equations

Established methods for whole-genome-sequencing (WGS) technology allow for the detection of single-nucleotide polymorphisms (SNPs) in the pathogen genomes sourced from host samples. The information obtained can be used to track the pathogen’s evolution in time and potentially identify ‘who-infected-whom’ with unprecedented accuracy. Successful methods include ‘phylodynamic approaches’ that integrate evolutionary and epidemiological data. However, they are typically computationally intensive, require extensive data, and are best applied when there is a strong molecular clock signal and substantial pathogen diversity. To determine how much transmission information can be inferred when pathogen genetic diversity is low and metadata limited, we propose an analytical approach that combines pathogen WGS data and sampling times from infected hosts. It accounts for ‘between-scale’ processes, in particular within-host pathogen evolution and between-host transmission. We applied this to a well-characterised population with an endemic Mycobacterium bovis (the causative agent of bovine/zoonotic tuberculosis, bTB) infection. Our results show that, even with such limited data and low diversity, the computation of the transmission probability between host pairs can help discriminate between likely and unlikely infection pathways and therefore help to identify potential transmission networks. However, the method can be sensitive to assumptions about within-host evolution.

(1) Here, e represents an external force of infection (which depends on the number of other infected host in the population of interest), while si represents the transition rate from status ni to ni+1. As we are interested in the progression of the infection within the host, for the purpose of this study we do not need to define e further. it denotes the number of SNPs ( 0 2 , 0 9 , 0 : …) in the mutant strains. We defined the extra SNPs in a given sampled mutant strain as divergent SNPs (as they make the mutant strain diverge ancestral strain). Although this differentiation is not particularly useful in the singleindividual framework, it becomes so as we move to consider a pair of hosts (see Figure S1.1, left). In the case of the single-host described general model, we can extend equation (1) by adding the within-host strain evolution to the disease progression as follows: where µ corresponds to the substitution rate. This system implicitly implies three assumptions: (i) there is only one strain infecting a single host, (ii) within-host substitutions happens instantaneously and the time-to-event is exponentially distributed, and (iii) the effective population size is arbitrarily small.
For simplicity we do not differentiate between mutations to different nucleotides and assume that all nucleotide transitions are equally likely (i.e. rate of A ® T is the same as rate of C ® 3 G, etc.). Assuming a host has been exposed to the infection at time t 0 and sampled at time t T , we can calculate the probability of the strain showing any given number of SNPs by numerically solving equation (2) in this time interval (from t 0 to t T ).

(b) SEI model and the pairwise KFEs
For the purpose of this study, we described the equations using a three-state Susceptible-Exposed-Infectious (SEI) model (see main text, Materials and Methods section).
Considering this model implies assuming a maximum of N = 2 infectious states in which no = S, n1 = E, n2 = I, s1 = s (i.e. transition rate from E to I), s2 = 0. The equations for a single host then become (see Figure S1.1, right panel): After defining the equations for a single host, we can extend them for a host pair. In this case, we assume that, at the beginning of the calculation, one host is infected but not yet infectious Ignoring the strain evolution, we can write the equations for a host pair as follows: where Pij is the probability of A being in status i and infectious and B being in status j (e.g.

PES correspond to Pij with A in status i = E and B in status j = S).
The underlying assumption is that one host harboured a single pathogen strain at the time of infection, but further pathogen replication resulted in the generation of pathogen diversity.
One lineage is then sampled in the 'source' host (or 'infector'), while the strain taken from the 'recipient' (or 'infectee') host comes from a different lineage and is likely to display additional diversity that distinguishes it from the first. Since eventual mutation happened in the ancestral strain prior to its transmission to the infectee did not contributed to the genetic distance between the two, this model does not consider them and it accounts only for the ones happened after the presumed infection.
By adding the strain evolution dynamics, we obtain the following system:

(c) Transmission probability calculation
The main goal is to use the KFEs to calculate the transmission probability P→R from an infector A to an infectee B. This translates into using the KFE system (5) to calculate: with t = tT -t0, which corresponds to the time span between A infection (t0) and both host sample (tT , i.e. the time of pathogen strains observation), with k and l representing the divergent SNPs of A and B strains, respectively.
At the initial time (t0) we set the probability of the infector and infectee being, respectively, at the exposed (E) and susceptible (S) stage to one: H @ + ( 8 ) = 1, while all the other system states probability is zero ( J @ + ( 8 ) = J @ H @ ( 8 ) = J @ J @( 8 ) = J 5 H @ ( 8 ) = ⋯ = 0). From this point, it is straightforward to calculate the probability for any possible combination of divergent SNPs in the two hosts strains, and for any time forward.
In the case where the sampling times differs (tA.¹ tB), we must calculate P→R in two steps. If tA < tB, we can first use the system (5) from time t0 until A was removed at tA. Then, we have to sum the probabilities calculated with the former for any potential state of B as follows: with Z representing a vector including all possible states of host A. Therefore, we can set the initial condition for a single-host KFE model like the one represented in (3), and solve this from tA to tB. An analogous reasoning holds for the case tB < tA.

S2. Simulated tree analysis (a) Transmission trees simulation
In order to validate our methodology, we applied it to a simulated dataset. We simulated 50 transmission trees using the R package epinet 5 (model homogeneous Susceptible-Exposed-Infectious-Recovered, total population: 60 individual, b = 0.00075, exposed period: 8 months, infectious period 10 months, SNPs substitution rate 0.00001, maximum number of loci 1000). We assumed that the state Removed transition corresponded to the sampling.
Once transmission trees were created, we imported them in PBuss 6 in order to generate a phylogenetic tree, with genetic distance measured in SNPs (saved in a .fasta file).
Among the 50 trees, we choose two trees with 10 or less infectious individuals for computational reasons, but showing genetic diversity (max SNPs distance ³ 3). The two trees are reported in Figure S2.1.

(b) KFE analysis
Following the methodology described in the main text Method and in S1, we analysed all transmission pairs with the Kolmogorov Forward Equations. In order to simulate parameters uncertainty, we used the same parameters space algorithm and a parameter distribution for each one (b: betaPERT(0.00025, 0.00075, 0.0015), exposed period: U(1, 24), SNPs substitution rate U(0.000001, 0.001)).
As reported in Figure S2.2 (panel A), for both trees the pairs SNP distance affects the calculated transmission probability. However, by plotting the probability against the difference in sampling time ( Figure S2.1, panel B), we could observe that for similar strain (SNP distance = 0) the probability is higher for host pairs sampled closer in time, while as we increase the SNP distance, in most case the probability had a different behaviour with some increase for hosts sampled apart.   In order to check the transmission tree reconstruction procedure (main text, section 2.2) we run 1,000 stochastic reconstruction of the simulated trees. As showed in Figure S2.5, for Tree 1 (panels A and C) we were able to find the correct configuration (seven out of seven correct transmissions). As the number of correct transmission increases, the median likelihood increases as well, until five out of seven correct transmissions were selected. Past that point, the average likelihood decreases, probably because of the lower number of configurations.
For Tree 2 the likelihood increases until there are six out seven correct transmission pairs, but out of the 1,000 stochastically reconstructed trees, we did not observe any with seven, eight

Correct transmissions
Tree likelihood D or nine correctly identified transmission pairs. This was due to the lower score of the correct transmission to infectees 2, 6, and 9 (see Figure S2.4).
Overall, we can conclude that the KFE methodology provide good results in providing hidden information by combining the genetic distance and epidemiological information. Tree reconstruction could be improved, although as the number of similar strains increases, it becomes more difficult to identity the correct transmissions.  As Figure S3.2 shows, in the case where two infected hosts M. bovis strain has no genetic difference (zero divergent SNPs in both), the probability of an intermediated infection with an unsampled individual was higher than for direct transmission when they are sampled more 12 than four years apart. An increased number of divergent SNPs (both within the source and infected sequences) corresponded to a decreased temporal threshold. In fact, in the case where both A and B sequences included three divergent SNPs (total distance six), the threeindividual transmission chain was more likely than the direct one for Dt of one year.  sequences sampled in animals in seven sub-clades (a to g) of clade 4. The sub-clades were chosen to reduce the number of pairs tested, and selected to include more than five but less than 25 hosts (see Table 1 in the main text for number of hosts in each clade or sub-clade). B: transmission probability pairs, the dot size is proportional to probability, and colour defines the transmission direction (red for badger-to-badger, green for badger-to-cattle, lightblue to cattle-to-badger, and magenta for cattle-to-cattle). C: phylogenetic tree (grey-labelled strains were excluded because were sampled from the same individual, identified by the label number, potentially at different times).