Integrating epidemiological and genetic data with different sampling intensities into a dynamic model of respiratory syncytial virus transmission

Respiratory syncytial virus (RSV) is responsible for a significant burden of severe acute lower respiratory tract illness in children under 5 years old; particularly infants. Prior to rolling out any vaccination program, identification of the source of infant infections could further guide vaccination strategies. We extended a dynamic model calibrated at the individual host level initially fit to social-temporal data on shedding patterns to include whole genome sequencing data available at a lower sampling intensity. The study population was 493 individuals (55 aged < 1 year) distributed across 47 households, observed through one RSV season in coastal Kenya. We found that 58/97 (60%) of RSV-A and 65/125 (52%) of RSV-B cases arose from infection probably occurring within the household. Nineteen (45%) infant infections appeared to be the result of infection by other household members, of which 13 (68%) were a result of transmission from a household co-occupant aged between 2 and 13 years. The applicability of genomic data in studies of transmission dynamics is highly context specific; influenced by the question, data collection protocols and pathogen under investigation. The results further highlight the importance of pre-school and school-aged children in RSV transmission, particularly the role they play in directly infecting the household infant. These age groups are a potential RSV vaccination target group.


A1. Model modification to fit pathogen data identified at group resolution
The model for fitting to group-level data is similar in structure to the model of sequence data presented in the main text, however, there is no identification of the infecting pathogen at the cluster level, only at the group level. The rate of exposure to a particular RSV cluster g acting on a susceptible person i from household h at time t: ! The background function / ( ) is derived the same way 0 ( ) is, as described in the main text.
Following from the rate of exposure is the probability of exposure give by: The probability of onset is given as:   while the y-axis shows the log parameter value.
To confirm convergence observed in the trace plots, we calculated the Gelman-Rubin-Brooks statistic and the effective sample size. When using the GRB statistic, convergence is said to have occurred if the ratio of pooled/within chain variance is close to 1. The GRB the GRB statistic as the number of iterations increases for each parameter. This is to check whether a value close to one was reached by chance or if the trend line had truly stabilized close to 1.    The mGRB is 1.09 and the mESS is 4146.
As a rule of thumb, a GRB of <1.1 is generally considered good, as such, it is safe to conclude that there was convergence.

A3. Model validation
To validate the model, we simulated multiple epidemics and checked to see if the observed epidemic was captured by the range of simulated dynamics. In addition to comparing the time course of cases, we also looked at the total number of cases in an epidemic, the proportion of individuals with multiple onsets and the number of cases in the first and last week of the time period. These values from the data were compared to the range of simulated values to check that key aspects of the epidemic were being reproduced by the simulations.
The results of the model fitting are the posterior parameter distribution and corresponding augmented data for the cluster ids of cases with no genetic information. A simulation based on a set of parameter values will also be based on the corresponding augmented data which will be used to derive a complete set of shedding profiles from the observed data. A single shedding profile is a combination of duration of shedding, viral loads and symptom status, and genetic cluster. The simulation pseudo code per simulation is as follows: Where !,#,1 ( )= rate at which person i is exposed to infection of cluster type C.
Determine who experiences each cluster specific transmission event. For a given event, order individuals capable of experiencing the event. For a given person p to experience the event, the following inequality has to be satisfied.
Where: Shedding profiles are derived from the observed data and a combination of duration of shedding, viral loads and symptom status, and genetic cluster.
The shedding profiles are grouped by age in the following 4 groups <1,1-5, 5-15 and ≥15 years. Once latency durations and shedding profiles have been assigned, the state variables for each individual are updated accordingly.
To explore how much variation there can be in the simulations from a single parameter set, a set of 12 parameter set samples were used, and for each set, 100 simulations were run, giving a total of 1200 simulations. We then sampled 100 parameter sets and run single simulations from each to explore between-parameter-set variation. The results of the simulations are presented in the form of epidemic curves and summary measures that are used to compare the main features of the outbreak.   Each panel shows the distribution of the total numbers infected in the simulations run using 12 different parameter sets (violin plots) compared to the total number from the observed data (dashed red line). The y-axis shows the total number and the x-axis is labeled by parameter set used. Top row: RSV A results for all the cases (1 st column), cases < 1 year old (2 nd column), cases between 1-5 years old (3 rd column) and cases > 5 years old (4 th column).
Bottom row: RSV B results. Violin plots are a combination of box plots and density distributions, the shapes should therefore be interpreted as density plots would while the ranges should be interpreted as the tips of whiskers in a box and whisker plots.

A4. Further details of the transmission model
The community rate of exposure Though there are cases from different households in the data, the sample in the study is small relative to the number of households in the community, as such the true number of infectious community contacts is unknown. We therefore split the community rate of exposure into two components: exposure from sampled neighbours, _ ℎ _ #,1,)→! ( ), and exposure from unknown sources represented by a time varying function fc(t). We assumed that the rate of exposure from a sampled neighbour is dependent on the spatial distance between individuals.      For i with no onset of type c: Where T is the end of the observation period.
For i with an onset of type c, the likelihood is given as: In this instance, to factor in the genetic data we modify the rate of exposure given in Eq 1 in the main text such that:

A5. Data pre-processing
Imputing complete shedding, ARI and presence durations.
Given the 3-4 day sampling intervals, complete shedding and ARI durations had to be imputed, and missing viral loads linearly interpolated. For the model, we will assume that all the cases were observed, and ignore the possibility of short duration shedding episodes that could have been missed by the sampling intervals. During the sample-collection visits, if a household member was not present, they were recorded as being 'away' on that particular day. As with the shedding information, there was incomplete information on continuous periods of presence or absence from the household which was also imputed.
An RSV A/B shedding episode is defined as a period within which an individual provided PCR positive samples for RSV A/B that were no more than 14 days apart. Using the mid-point method, shedding was assumed to start mid-way between the last negative sample and the first positive sample, and it ended midway between the last positive sample and the first negative sample of an episode. This is illustrated below: Grey circles are positive samples in a single episode, empty circle are negative. t1, t2, t3 and t4 are dates of sample collection.
For (t4-t3) and (t2-t1) ≤7 days Where x = mean of sampling intervals for samples in an episode, which was found to be 3.45 days.
Any negative samples (Ct >35 or Ct=0) in between a shedding episode were ignored, i.e.
were not treated like true end of shedding Imputed Shedding duration Imputed ARI duration Time In this case, the mean sampling interval for ARI 'samples' within an episode was 3.78 days.
This was obtained from all ARI episodes not just the ones within shedding episodes The imputation of continuous periods of presence or absence from the household was done similar to the imputation of shedding durations, however, there was no left or right censoring. Each participant had a set of days of recorded data, these days were either marked as 'away' or 'present' in the household, e.g. a participant might have data on days {32, 36, 39, 43, 46, 50, 53, 57} with status {away, away, present, away, present, present, away, present}. Since no data is available for this individual before day 32 and after day 57, no imputation is done outside this time window. For the days within the window, imputation is done as illustrated below:

Filled circles are days when the participant was recorded as being present while open ones is
when they were away. The present period starts halfway between the last 'away' and first 'present' and ends halfway between the last 'present' and first 'away'.
In order to include information of the amount of virus shed by an infected person into the transmission model, the Ct value need to be converted to log10 RNA copy number which is a more direct measure of viral load. The formula used to convert Ct values to their log10 RNA equivalent was y= -3.308x + 42.9, where y=Ct values and x=log10 RNA copy number 3,4 . Following conversion of the PCR Ct values to viral load, we proceeded to interpolate the viral loads for days in an episode that did not have data. Linear interpolation was used for all the shedding episodes. It was assumed that the starting and ending sample, if data was missing, had a viral load of 2.388 log10 RNA (baseline positive Ct value converted to viral load). For two samples of viral load Va and Vb at times ta and tb, tb > ta, the gap in between is filled out as follows: For tbta =n, viral load Vj at time point tj for j=1…(n-1) is given by 6 Viral loads lower than 2.388 log10 RNA in between an episode were not included in the interpolation

Imputing missing genetic information
The WGS data was used to rule out transmission events where it was assumed that cases in different genetic clusters are not part of the same transmission cluster. It was also assumed that for cases within the same genetic cluster, the likelihood of a transmission event is weighed according to pairwise genetic distance dgen(i,j). As mentioned in the main text, genetic clusters were established based on a combination of criteria: nucleotide distance cut-off, clustering patterns on the global RSV phylogeny and the inferred date of sequence divergence. As a result of incomplete sequencing of all the positive samples, there are gaps in the genetic data. To fill these in we classified missingness into 3 categories and exploited elements of the study design to fill in the gaps.
• Missing level 1: non-sequenced samples part of an infection episode with ³1 other sequenced sample. The entire episode was assigned the cluster id of the sequenced sample(s), where there was more than 1 id, the episode was divided accordingly.
• Missing level 2: none of the samples in an episode were sequenced, but the episode is part of a social-temporal cluster with some genetic information. The entire episode was assigned the cluster id of the social-temporal cluster. This assumes that if an episode has a temporal overlap with other cases in the same household, they are likely part of the same infection cluster (household outbreak).
• Missing level 3: none of the samples in an episode were sequenced and there is no genetic information in the social-temporal cluster. The cluster id for the entire episode was treated as augmented data and inferred along with the model parameters.
Within a given RSV group, infection by a particular cluster is assumed to be a mutually exclusive process, an individual can only shed one cluster type at a time. The genetic data available is consensus whole genome sequences as such, only one cluster can be identified from a single sample.
Consider a case i who had an onset after case j, both of whom have sequences. The genetic distance between case i and j is obtained by comparing the first sequence available from case i and any sequence from j whose sampling time is closest to the first sequence from i.
In the illustration below, this would mean comparing sequence Si,1 to Sj,2 to obtain genetic distance dgen(i,j). The phylogenetic analysis of Agoti et al 5 found that long shedding episodes do not have drastically differing genetic sequences (<6 SNPs) as such it should not make a significant difference whether we compare sequences forward (Si,1 to Sj,2) or backward (Si,1 to Sj,1) in time.
If either one or both of the cases do not have sequences, then dgen(i,j) is randomly selected from the set of all pair-wise genetic distances from the specific genetic cluster. For cases with sequence data, dgen(i,j) is fixed, but for cases where one or both is missing sequences, dgen(i,j) changes every time the likelihood is calculated to reflect uncertainty. In this way only pairs of cases with sequence data contribute definitive genetic information to the parameter inference algorithm while the rest will not. We use nucleotide differences as the distance dgen(i,j). Once we have dgen(i,j), we then use this to obtain a genetic weight for the probability of a transmission event given by )→! = 9: !"# (!,)) * > where is the rate of exponential decay and is estimated along with other model parameters. This function form results in a negative exponential relationship between the genetic weight and the genetic distance between a pair of cases. and the augmented data DA given the observed data D. We assume that all the cases were observed but that for some of the cases, there is no information on the cluster id of the shedding episode, as such, the augmented data is the set of all shedding episodes whose cluster id was left unassigned by the imputation process previously described. These include cases that are part of household outbreaks with no genetic information and cases that are part of household outbreaks with more than one possible genetic cluster id. For cases that are part of an outbreak with no genetic information, a single cluster id is inferred for all the cases in the household outbreak.
Bayesian inference results in an updated distribution of the parameter of interest (posterior distribution) given prior assumptions/knowledge of the parameter (prior distribution) and an expression giving the probability of a parameter value given data (likelihood) i.e.
( | , F ) ∝ ( ) × ( | , F ). Where there is no exact expression for the posterior distribution, numerical methods are used to find an approximation of the target distribution, adaptive MH-MCMC is a popular first step.
We specified the target distribution as ( | , F ) = ( | F ) ( | , F ) ( ); ( | F ) = probability of the observed data given the augmented data; ( | , F ) = the likelihood of the parameters given the observed and augmented data; ( ) = the prior probability of the parameters. The augmented and observed data are independent and we have no information to inform what the missing cluster ids could be, making every combination of D and F equally likely. Consequently, we did not include P(D|A) when calculating the posterior probability. We used weakly informative priors in the form of a normal distribution with mean 0 and a standard deviation of ~3 for the log of parameters. We initiated 3 chains and set the algorithm to start adapting the proposal distribution based on accepted parameters after 10000, 15000 and 10000 iterations respectively. Burn-in was assessed visually after which the results of the three concurrent chains were combined to infer the posterior distribution. The three chains were run for 250,000 iterations each.

Choice of proposal distributions for the parameters
For the parameter set we used a multivariate normal distribution as the proposal distribution. For iteration n in the chain a new set * will be proposed such that * ~( 69P |Σ). The choice of the variance-covariance matrix Σ will determine the size of the space that is explored and how fast the MCMC chain converges. After a certain number of iterations, Σ was modified to ensure proper mixing. The modification was automated through an adaptive random walk MH-MCMC algorithm. There are several adaptation algorithms 6 , we chose one that learns from the empirical distribution of values up to the (n-1) th iteration to modify the Σ at iteration n. For samples { P , Q , R , . . . 69P } in the MCMC chain so far, at iteration n the proposal density g(.) is given by Σ 69P = The empirical variance-covariance matrix derived from samples { P , Q , R , . . . 69P } = The dimension of the parameter set Σ C = The initial guess of the parameter variance-covariance matrix. This is usually a diagonal matrix of variances.
This notation means for a fraction of the time (1 − ), the proposal distribution will be ( 69P |2.38 Q Σ 69P / )) and the rest of the time it will be ( 69P |0.1 Q Σ C / ). Prior to adaptation beginning at iteration n, the proposal distribution at iteration k is given by

Pseudo algorithm for our implementation of MH-MCMC
For each MCMC chain 1. Set initial values for the parameters and assign cluster ids at random for the outbreaks with no sequence information (uninformed outbreaks). iii. Withas the proposed cluster id, the proposed change to the augmented data is accepted with probability ′( F 69P , F * ) = min }1, ( 6 | , F * ) Where | 3 | is the number of household outbreaks in 3 in the present permutation of the augmented data F 69P and | -| is the number of household outbreaks in -.

A7. Further details of the HPTS
Establishing the highest probability transmission source (HPTS) Per case, we identified the transmission source that had the highest likelihood given the data and a parameter set * sampled from the joint parameter posterior distribution For a given source of infection Ω ! in the same household as i, the rate of exposure is given by: × 5,!67 @ g $ ,#,1 ( )I × g $ ,# ( )‹ For Ω ! not in the same household as i but among the sampled individuals, the rate of exposure is given by: For Ω ! an unknown source external to the household, the rate of exposure is given by: