Long-term monitoring of SARS-CoV-2 seroprevalence and variants in Ethiopia provides prediction for immunity and cross-immunity

Under-reporting of COVID-19 and the limited information about circulating SARS-CoV-2 variants remain major challenges for many African countries. We analyzed SARS-CoV-2 infection dynamics in Addis Ababa and Jimma, Ethiopia, focusing on reinfection, immunity, and vaccination effects. We conducted an antibody serology study spanning August 2020 to July 2022 with five rounds of data collection across a population of 4723, sequenced PCR-test positive samples, used available test positivity rates, and constructed two mathematical models integrating this data. A multivariant model explores variant dynamics identifying wildtype, alpha, delta, and omicron BA.4/5 as key variants in the study population, and cross-immunity between variants, revealing risk reductions between 24% and 69%. An antibody-level model predicts slow decay leading to sustained high antibody levels. Retrospectively, increased early vaccination might have substantially reduced infections during the delta and omicron waves in the considered group of individuals, though further vaccination now seems less impactful.

Supplementary Figure 1.Study flow and point prevalence for SARS CoV-2 seropositivity in healthcare workers and community members recruited in Jimma and Addis Ababa including long-term follow-up (LTFU) numbers and percentages.

Clustering of Antibody Data
The distribution for the Anti-S antibody levels in Figure SN1 has three distinct peaks: one peak close to zero, one peak at 2 and one peak at 3.5.Comparing to the distributions with reactivity of Anti-N and with the vaccination information one can derive that the first of those two peaks corresponds to one infection or vaccination and the second to two or more infections or vaccinations.As we use a compartment model for the subsequent analysis, we decided to categorize the antibody measurements based on the observed groups.Therefore, we combined all community measurements, respectively healthcare worker measurements, from different sites and rounds.First, we individually processed N and S measurements, excluding NaN values and measurements below the detection threshold.We utilized scikit-learn's k-means clustering implementation to categorize the remaining data points above the threshold into two distinct groups 1 .We chose clustering the antibody datasets separately, i.e. 1-dimensional clustering, motivated by the bi-modal distributions we observed in the histograms for Anti-S.Moreover, the separate clustering of the Anti-N or Anti-S data provides: (i) a slightly higher statistical power, since for some study participants only one the antibody tests, Anti-N or Anti-S, was successful; and (ii) clear cutoff values for aggregated Anti-S measurements (e.g. by using midpoint of the two resulting groups' centers), which is necessary for the multivariant model.The performance of the k-means clustering can be observed in Figure SN2.To visualize the data, the three groups (below the detection limit, above the limit but below the category separation, and above the category separation) were aggregated for each round.We employed the monotonic cubic spline interpolation of the scipy 2 package to interpolate the resulting values (Figure 1

of the main manuscript).
There was no vaccine publicly available in Ethiopia until after Round 3 3 .Because of this information about general vaccine availability in combination with our previous observation that vaccinated individuals are more likely to answer questions on the vaccination status on the questionnaire than unvaccinated individuals, we considered individuals without an answer ("N/A") as "unvaccinated" for modelling.This is also supported by official nation wide numbers of people with at least one dose of vaccine, provided by Our World in Data (ourworldindata.org)and depicted in Figure SN3.Moreover, we treat the effect of vaccine and infection on Anti-S levels analogously.This is based on the comparison of the observed antibody levels for healthcare workers (Supplementary Figure 1) and community members (Figure 1).There from Round 3 to Round 4 for healthcare workers a clear shift from medium Anti-S to high Anti-S is observed in response to vaccination, but community members reach the same levels by infections alone.

Round 1
Round

Sequencing Result of Variant Data
For the sequencing data, we merged the data sets from Addis Ababa and Jimma sites and removed entries, where sequencing failed.The observed Pango 4 lineages were assessed for Mutations of Interest or Concern (MOIC) by referencing the outbreak.info 5,6 atabase.Based on these mutations, the lineages were grouped and groups lacking sufficient statistical power, i.e. sample size below 3, were dropped from the data set.The complete list of observed lineages and their mutations is provided in Table SN1.The samples were then aggregated by the month of collection and interpolated using scipy's monotonic cubic spline interpolation for visualization purposes (Figure 2a of main part).

Supplementary Note 2: Multivariant Model
The multivariant model is encoded in the SBML 7 format, integrated with the parameter estimation problem in the PEtab 8 format and made available at Zenodo 9 .In the following, we provide a compact mathematical description, while for additional details we refer to the SBML file and the code.

Model Equations
We utilize the SEIR (susceptible, exposed, infectious, and recovered) framework as basis for our model structure.
Assuming a maximum number of 4 infections all combinations of our 8 variants would lead to a system of 8 4 = 4096 pathways.Hence in order to obtain a computationally feasible system while still retaining realism we exclude pathways which deviated from the chronological order of variant appearances worldwide.We define by P i the set of potential reinfections after infection with variant i, described by Table SN2 where vaccination is treated as previous infection with the wildtype variant.Furthermore to account for the reported inter-infection intervals we assume third infections before omicron played a negligible role and allow a fourth infection only for omicron BA.4/5, i.e.P i collapses to {7, 8} resp.{8}.For i = 1, . . ., 8 representing the variant index, where these numbers correspond to columns in Table SN2, we have the following equations for first infection or vaccination where Îi is the sum of all currently infected with variant i, N the sum of all state variables, t 0i the entrance date of variant i and v k denote the k-th vaccination rates.The second infections and vaccinations for i = 1, . . ., 8, v (numbers for infections, v for vaccination) and j = 1, . . ., 8 are described by where n(Idx) := #{v ∈ Idx}.
And finally for the fourth infection we have the equations for i, j = 1, . . ., 8, v and k = 7, 8, v The effective infection rates β Idx are split into three parts the seasonality factor s seas , the reinfection factor s reinf and the transmission rate βIdx[−1] of the currently encountered variant Idx[−1], i.e. variant corresponding to last index entry of Idx.
The seasonality is formulated as follows where s frac denotes the fraction of seasonality effect, i.e. it equals 1 if transmission rates are fully governed by as yearly cycle and it equals 0 if there is no seasonal effect.The sinus function introduces the periodicity which is 13/30 scaled to have a period of one year.Its peak is shifted by the parameter seas shift and the exponential function ensures positivity.
The reinfection factor depends on the previously encountered variants encoded in all but the last index entries Idx[: −1] and the currently encountered variant encoded in the last index entry Idx[−1] and is formulated as follows Here d(x, y) is the Hamming distance between MOIC observed in variant y and MOIC observed in variant or combination of variants x.The case where the previous infection(s) x is only one variant, not multiple ones, is depicted in main paper's Figure 2e.The parameters s 0 and s encode the risk reduction for being encountered with the same variant as previously and the risk reduction for an infection with a variant with mutation distance 1 to the former infection's variant, respectively.
In order to incorporate prior knowledge about the variants transmission rates βi , which is often provided relative between different variants, they are defined as multiplicatives of a base transmission rate β b or of other β j as depicted in Table SN3.
Table SN3.Definition of transmission rates for different variants.

Data Integration
The initial time of our model t = 0 is set to be the 13 th of March 2020 as this was stated by national test positivity data as first Covid-19 case in Ethiopia.
In order to map the model to our data we define three types of observables: Anti-S antibody prevalence, variant distribution and national incidence numbers.Antibody prevalence is observed as levels of 1 infection or vaccination and 2 or more infections or vaccinations and hence, its observable functions are defined by The variant observables are defined for i = 1, . . ., 8 as Finally the national test positivity rate is mapped to model simulations by Spline fit to cummulative vaccination numbers where s tpr will be estimated.
Measurement errors are assumed to be normally distributed for each time point and observable with standard deviations taken from the multinomial error estimation described below.
The three vaccination rates for first, second and third vaccination v 1 , v 2 and v 3 are fit previously to the parameter estimation as part of the modeling, by fitting monotonic cubic splines to the antibody cohorts' vaccination information and incorporating those splines directly as time dependent functions into the model.The result of those fittings can be seen in Figure SN5.
For improved time resolution of the antibody data in the estimation process while remaining reasonable errors we split each round into two subrounds by performing k-means clustering on its sampling dates (Figure SN6).The antibody prevalence levels were clustered as described in Supplementary Note 3 and then aggregated within the subrounds.
Error estimates for antibody and variant data were obtained by fitting multinomial models for each data-type timepoint combination using pymc3 10 .Error estimates for the national test positivity rate were obtained by fitting binomial models using pymc3.The sample sizes used for these estimations are listed in Table SN4.

Parameter Estimation
There are a total of 24 model and observation parameters subject to estimation.They are listed in Table SN5 including prior information, boundaries, the maximum a-posteriori used as starting point of sampling (obtained by gradient based optimisation) and their sampling result.
For the base transmission rate we use as priors the Bayesian estimation results of the SEIR model of our previous study and priors for incubation and recovery times are taken from literature as established in before 11 .Also prior information about the increased transmission rates of variants are taken from literature, where available.

Model Analysis
The model estimates the entry time points of most variants substantially later than the first global appearance according to outbreak.info(Figure SN9).Global reporting and local estimation coincide only for wildtype*, alpha and beta, the latter of which does not play a large role in the overall dynamics observed and estimated by us.
Including the top ten infection-vaccination pathways depicted in main paper's Figure 4 the model estimated 68 pathways contributing more than 0.1 % (Table SN6).We calculated them by checking sizes of all, and in particular the recovered compartments, after simulating the model until t = 1200, where we encountered equilibrium due to the lack of new variants after omicron.Then we investigated the transitions inside the pathways (Figure 4c of main manuscript) by appropriately scaling and stack-plotting the time courses of all recovery states being part of the pathway.For example for R 1,2,3,4 this we would plot R 1 , R 1,2 , R 1,2,3 and R 1,2,3,4 .

Data Integration
Initial time of the model t = 0 is set to be the 13 th of March 2020 as for the multivariant model.
The observables mapping the antibody-level model to data are Ãi j = A i j + E i j + I i j N and the national test positivity data is mapped with a scaling as before for the multivariant model.
Measurement errors are assumed to be normally distributed and obtained by multinomial error modelling as described above.The sample sizes used for the error estimates of the nine antibody categories are listed in Table SN7.
The antibody rounds are split into subgroups by sampling dates and clustered into categories as before.Moreover errors of all data types for estimation are again obtained by multinomial, resp.binomial, models.
The vaccination rate v is implemented as a piecewise linear function which is calculated by monthly averaging the vaccination information of the antibody sampling cohort a priori to the parameter estimation.The results of this can be seen in Figure SN10.Note in the equations of the model we made the assumptions that people with already high Anti-S levels do not get vaccinated anymore, i.e. the amount of people still applying for vaccination after two infections or vaccinations is negligible.
For the antibody-level model the variant data is directly incorporated as a time-dependent function.First, the variant data is aggregated into 2-month bins, and Gaussian kernels are fit to the distributions using scipy's "minimize"  function.Finally, the distributions are normalized so that they sum up to 1.The result of this fitting process is illustrated in Figure SN11.

Parameter Estimation
There are a total of 20 model and observation parameters subject to estimation.They are listed in Table SN8 including prior information, boundaries, the maximum a-posteriori used as starting point of sampling (obtained by gradient based optimisation) and their sampling result.
The parameter sampling for the multivariant model was performed with a sample size of 1e5.The first 7e4 samples were removed as burn-in.The remaining sample passed the Geweke convergence test as well as visual examination (Figure SN12).Parameter correlations and distributions are depicted in Figure SN13.
Ro-N-Ig and Ro-RBD-Ig-quant measurements of five rounds of convenience sampled healthcare workers.a-e Scatterplots displaying the relationship between levels of N-and S-specific antibodies (y-axis, resp.x-axis) across five rounds of measurement.Known vaccination status of each participant indicated by colors, cutoff levels indicated by dashed lines and percentages of people per category annotated in red.f-g Evolution of antibody levels over time between Fall of 2020 and April 2022.Times of sample acquisition are highlighted as circles.Source data are provided as a Source Data file.
-S antibody levels (Healthcare workers)

Figure SN1 .
Figure SN1.Distributions of Anti-S antibody levels for healthcare workers and community members.Reactivity of Anti-N and vaccination status highlighted by colors (first resp.second row for each group).Source data are provided as a Source Data file.

Figure SN2 .
Figure SN2.Distribution of positive antibody measurements for community members and healthcare workers (HCW).Thresholds computed with k-means clustering are highlighted.Source data are provided as a Source Data file.

Figure SN5 .
Figure SN5.Spline fits to cumulative counts of first, second and third vaccination information obtained from the antibody study's participants.Source data are provided as a Source Data file.

Figure SN6 .
Figure SN6.Subgroups of antibody sampling rounds obtained by k-means clustering of sampling dates for community members and healthcare workers.Source data are provided as a Source Data file.

Figure SN7 .
Figure SN7.Multivariant model's sampled log-posterior and parameter traces.Burn in phase is cut off.Source data are provided as a Source Data file.

Figure SN8 .
Figure SN8.Scatter plots and distributions of multivariant model's sampled parameters.Source data are provided as a Source Data file.

Figure SN10 .
Figure SN10.Monthly averaged vaccination rates and cumulative vaccinations of antibody study's cohort.Source data are provided as a Source Data file.

Figure SN11 .
Figure SN11.Fits of normalized Gauss kernels to mean variant data.Variant data depicted as mean -/+ standard deviations.Sample sizes listed in TableSN4(b).Source data are provided as a Source Data file.

Table Supplementary Table 1 .
Demographic characteristics of healthcare workers study participants.Age denoted as median and 90% quantiles.Round 1-3 (R1-R3) are the previous study of Gudina et al 2021.
continued on next page 8/30

Table SN2 .
Boolean table of possible reinfections, where 1 means reinfection in model possible and 0 means no reinfection allowed.Rows represent variants of previous infection and columns the variants of reinfection.

Table SN4 .
Sample sizes of aggregated measurements used for fitting the multivariant model.Listed per corresponding observable and total sample sizes per time point.Time depicted in days since 20 th March 2020.

Table SN5 .
Parameters of the multivariant model.Estimated entry times of variants.First global appearance and earliest date in sequenced data set included for comparison.Source data are provided as a Source Data file.

Table SN6 .
Pathways of the multivariant model which account for more than 0.1%

Table SN6
Allowing all pathways between variants.(iii)No grouping of variants.In the end all of these formulations proved impractial.For (i) we would have to model individual parameters for each combination of past infections and new infections.Even with the other simplifications of the model still in place this leads to a total of 205 immune escape factors instead of the two we have in the current model.For such a high dimensional parameter estimation the dataset would have been insufficient to inform.(ii) would result in a model with 12289 different states being computationally infeasible.Extension (iii) implies 50 different variants instead of the current 8 lineages.Even if we disregard the low statistical power we have for some of these single sublineages, we would still end up with more than 10000 different model states and five times as many parameters

Table SN7 .
Sample sizes for aggregated 2-dimensional antibody measurements corresponding to the observables Ãi j and total sample sizes per time point.Time depicted in days since 20 th March 2020.