Emulator-based Bayesian optimization for efficient multi-objective calibration of an individual-based model of malaria

Individual-based models have become important tools in the global battle against infectious diseases, yet model complexity can make calibration to biological and epidemiological data challenging. We propose using a Bayesian optimization framework employing Gaussian process or machine learning emulator functions to calibrate a complex malaria transmission simulator. We demonstrate our approach by optimizing over a high-dimensional parameter space with respect to a portfolio of multiple fitting objectives built from datasets capturing the natural history of malaria transmission and disease progression. Our approach quickly outperforms previous calibrations, yielding an improved final goodness of fit. Per-objective parameter importance and sensitivity diagnostics provided by our approach offer epidemiological insights and enhance trust in predictions through greater interpretability.


Attenuation of inoculations 5
Pathogenesis models Indirect pathogenesis and comorbidity models

Infection of the human host
The seasonal pattern of entomological inoculation rate (EIR) determines seasonal pattern of transmission and thus the parasite densities in the individual, modified by natural or acquired immunity and interventions 2 .
1.2.1 Differential feeding by mosquitoes depending on body surface area In the base model, the expected number of entomological inoculations experienced by individual ! of age " at time # is where $ !"# (#) refers to the annual entomological inoculation rate (EIR) computed from human bait collections on adults and '(), is the individual's availability to mosquitoes, assumed to be proportional to average body surface area, depending only on age . '("(!, #)* increases with age up to age 20 years where it reaches a value of ' !"# (the average body surface of people ≥ 20 years old in the same population).
The biting rate in relation to human weight is based on data from The Gambia published by Port and others 12 , where the proportion of mosquitoes that had fed on a host were analyzed in relation to the host's contribution to the total biomass and surface area of people sleeping in one mosquito net 5 .

Control of pre-erythrocytic stages
The number of infective bites received per unit time for each individual !, adjusted by age, is given by Eq. 1 above. A survival function /(!, #) defines the probability that the progeny of an inoculation survives to give rise to a patent blood stage infection, i.e., the proportion of inoculations that result in infections or the susceptibility of individual ! at time #. The force of infection is modelled as where $ " (!, #) is the expected number of entomological inoculations endured by individual ! at time #, adjusted for age and individual factors, and the number of infections ℎ(!, #) acquired by individual ! in five-day time step #, follows a Poisson distribution: The susceptibility of individual ! at time #, /(!, #) is defined as: where / &!! , ; ' * , $ * , @ ' and / $ are constants representing the lower limit of success probability of inoculations in immune individuals, critical value of cumulative number of entomologic inoculations, critical value of $ " (!, #), steepness of relationship between success of inoculation and ; ' (!, #), and, the lower limit of success probability of inoculations at high where $ " (!, #), respectively. Here ℎ(!, #)~ CD!EEDF(A(!, #)*.
/ $ and $ * are fixed to / $ = 0.049, and $ * = 0.032 inoculations/person-night and are detailed in 5 . where P / (I, I !"# ) is taken from a statistical description of parasite densities in malariatherapy patients and J(!) describes between-host variation with a log-normal distribution with variance Q & 0 .
s A (9) (#) and $ !"# (9) (# + Z 8 ) are the corresponding values for the intervention scenario. Then where Z 8 corresponds to the duration of the sporogenic cycle in the vector, which we approximate with two time steps (10 days).
is the total vectorial capacity)

Morbidity
In order to simulate the clinical state of individual ! at time #, for each five-day time step 5 independent samples from the simulated parasite density distribution are drawn for each concurrent infection N.

Acute morbidity (uncomplicated clinical cases)
The model for an episode of acute morbidity was originally described in 10 and occurs in individual ! at time # with probability where R * is the pyrogenic threshold and R !"# is the maximum density of five daily densities sampled during the five-day interval #.
The pyrogenic threshold changes over time following where t 9 (R(!, #)* is a function describing the relationship between accrual of tolerance and the parasite density R(!, P); t 0 (R * (!, #)) describes the saturation of this accrual process at high values of R * and u vR * (!, #) determines the decay threshold with first-order kinetics, ensuring that the parasite tolerance is short-lived 10 .
Here t 9 (R(!, #)* is defined to ensure that the stimulus is not directly proportional to R but rather that it asymptotically reaches a maximum at high values of R: At high values of R * , a higher parasite load is required to achieve the same increase: Thus, the pyrogenic threshold R * is defined to follow and the initial condition R * (!, 0) = R 3 * at the birth of the host, where ^, u v R 3 , R 9 * and R 0 * are targets of the calibration, and are defined in Supplementary Table 1.

Severe disease
The model for severe disease was described in Ross et al. 2006 11 and two different classes of severe episodes are considered by the model, x 9 and x 0 . C F9 (!, #) is the probability that an acute episode (') is of class x 9 and where R F9 * is a constant to be calibrated and y(!, #) is the clinical status of individual ! at time #.
Class x 0 of severe malaria episodes occurs when an otherwise uncomplicated episode coincides with some other insult, which occurs with risk where z 3 is the limiting value of z("(!, #)* at birth and " G * is the age at which it is halved, and both are to be calibrated.
The probability that individual ! experiences an episode belonging to class x 0 at time #, conditional on there being a clinical episode at that time is The age ant time specific risk of severe malaria morbidity conditional on a clinical episode is then given by

Mortality
Malaria deaths in hospital are a random sample of admitted severe malaria cases, with age-dependent sampling fraction { 2 ("), the hospital case fatality rate, derived from the data of Reyburn et al. The severe malaria case fatality in the community for age group ", { H (") is estimated as where | 9 the estimated odds ratio for death in the community compared to death in in-patients is an age-independent constant to be calibrated and { 2 (") is the hospital case fatality rate. The total malaria mortality is the sum of the hospital and community malaria deaths.
The risk of neonatal mortality attributable to malaria (death in class ] 9 ) in first pregnancies is set equal where ~I / is related to ~J / , the prevalence in simulated individuals 20-24 ears of age via and ~J / * and ~I / * are constants to be calibrated and are detailed in Supplementary Table 1.
~I / = 1 − 1 1 + :~J / J/ * < (34) An indirect death in class ] 0 is provoked at time #, conditional on there being a clinical episode at that time with probability C K0 (!, #) where and where { K is limiting value of C K / (!, #) at birth and " G * is a constant to be calibrated. Deaths in class ] 0 occur 30 days (six time steps) after the provoking episode.

SUPPLEMENTARY NOTE 2: CALIBRATION APPROACH AND DATA SUMMARY
A comprehensive epidemiological calibration dataset was collated to parameterize OpenMalaria. This calibration dataset covers a total of eleven different epidemiological relationships (or objectives for fitting) that span important aspects of the natural history of malaria. Data were collated from different settings (see Supplementary Table 2 for summary) and were detailed in the original model descriptions 2,10 and a later parameterization 1 . A total of 61 simulation scenarios were setup to parameterize OpenMalaria, constructed to simulate the study surveys and study sites that yielded the calibration dataset. The study site observations were replicated in OpenMalaria by reproducing the timing of the surveys and their endpoints (such as prevalence and incidence) and matching simulation options to the setting with regards to transmission intensity and seasonality, vector species, treatment seeking behavior and anti-malarial interventions. The objectives and data are further detailed below.
The parameter estimation process is a multi-objective optimization problem with each of the epidemiological quantities in Supplementary Table 2 representing one objective. The aim of the optimization is to find a parameter set that maximizes the goodness of fit by minimizing a loss statistic computed as the weighted sum of the loss functions for each objective. Building a weighted average reduces the multiple loss terms to a single overall loss statistic, defined as: where ! !" (#) is the loss function for parameter vector #, epidemiological quantity & and dataset ', and the weights ( ! were chosen so that different epidemiological quantities contribute approximately equally to )(#). an unknown minimum. We did not update these loss-functions in order to compare to our previous approaches.
The likelihood functions are given by where the observed values are * # , … , * $ and the model parameters #. In practice, it is easier to work with the log likelihood, namely The loss functions ! ! (#) used for each objective are detailed in the following sections.

Objectives: Epidemiological data and loss functions
Below we described each fitting objective in terms of the data (setting, surveys, observations, references) along with the associated loss function and original references. Supplementary Table 3 provides an overview of the 61 simulation scenarios used for calibration, and which objective they contribute to. For space reasons, Supplementary

.1 Data
The data used for the calibration of objective 2 (age-patterns of prevalence) consists of six crosssectional malariology surveys conducted in the Rafin Marke, Matsari, Sugungum villages in Nigeria 1970-1972 (12 age groups each, part of the Garki Project during the pre-intervention period) 14 , Navrongo in Ghana 2000 (12 age groups) 15 and Namawala 1990-1991 21 and Idete in Tanzania (11 and 6 age groups, respectively) 1992-1993 17 . In all study sites, annual transmission intensity (EIR) and seasonal patterns were assessed using light trap or human night bait collections and dissections of the salivary glands (see Fig. 2

Loss function: Binomial Log Likelihood
We denote the binomial log likelihood for each scenario of this objective to be where ! is the number of age groups, " the number of surveys, ) !,# the scenario data number of parasite positive hosts and % !,# the scenario data number of hosts for age group k and survey j. Parameter # !,# is associated with the model predictions and is given by where ) $,# * are the predicted number of parasite positive hosts and % $,# * the predicted number of hosts for age group & and survey '.

Data
The same data sources as for objective 2 (age pattern of prevalence) were used for calibration of objective 3 (age pattern of parasite density). Parasite densities in sites that were part of the Garki project (Sugungum, Rafin-Make and Matsari, Nigeria) were recorded by scanning a predetermined number of microscope fields on the thick blood film and recording how many had one or more asexual parasites visible. These were converted to numbers of parasites visible by assuming Poisson distribution for the number of parasites per field and a blood volume of 0.5 mm 3 per 200 fields. In the other studies (Idete and Namawala, Tanzania and Navrongo, Ghana), parasites were counted against leukocytes and converted to nominal parasites/microliter assuming the usual standard of 8,000 leukocytes/microliter. The biases in density estimates resulting from these different

Loss function: Poisson Log Likelihood
Assuming that both the data and the simulations are Poisson distributed about the correct value and thereby also allowing for over-dispersion, we denote the Poisson log likelihood for each scenario to be for the objective of age pattern of number of concurrent infections to be where ! is the number of age groups, " the number of surveys, ), !,# the scenario data total patent infections for age group & and survey '. Parameter ? !,# is associated with the model predictions and is given by

Loss function: RSS-biased
We denote a loss function based on biased residual sum of squares: where ! is the number of age groups, " the number of surveys, " ' the initial survey number, and E is the residual given by where F !,# is the observed recorded incidence rate, G $,# * are the predicted total cases (severe and  2.1.6 Age pattern of incidence of clinical malaria: infants 2.1.6.1 Data Objective 6 (age pattern of incidence of clinical malaria in infants) is informed by a dataset on incidence that contains passive case detection data on the age-incidence in infants recorded at the health centre in Idete, Tanzania from June 1993-October 1994 17

Loss function: RSS-biased
The loss function for Objective 6 is the same as Objective 5. For scenario 49 (Idete, Tanzania) the bias is H = 0.357459 and represents the proportion of episodes reported to the village dispensary.

Data
Objective 7 (age pattern of threshold parasite density for clinical attacks) uses the dataset from Dielmo, Senegal (see objective 5) for calibration. The pyrogenic threshold in the (OpenMalaria) predictions is output as the sum of the log threshold values across age groups. The pyrogenic threshold per age group is given as the parasite:leukocyte ratio for recorded incidence of disease. To adjust these densities to the same scale as that used in fitting the simulation model to other datasets, the parasite:leukocyte ratios were multiplied by a factor of 1,416 to give a notional density in parasites/microliter of blood. This number was derived as follows: Parasites were counted against leukocytes and converted to nominal parasites/microliter assuming the usual (though biased) standard of 8,000 leukocytes/microliter. The biases in density estimates resulting from these different

Loss function: RSS-biased (log)
For the objective 7 (Age pattern of threshold parasite density for clinical attacks) we denote a residual sum of squares loss function given by (13) with where > * is the observed pyrogenic threshold, > * * are the predicted sum log pyrogenic threshold, % $,# * the predicted number of hosts for age group & and survey ' and is a bias related to the scenario. Here, this bias is related to the log parasite/leucocyte ratio and thus H = 1/(8000+ ' ) where + ' is the non-Garki density bias.

Data
Data on the relative incidence of severe malaria-related morbidity and mortality in children <9 years old across different transmission intensities were originally collated by Marsh and Snow (1999) 19 (Table 4). Data measurements per age group were available as the relative risk of severe disease compared to age group 1 and the proportion/prevalence of severe episodes. A total of 26 entries on the relationship between severe malaria hospital admission rates and P. falciparum prevalence were used to calibrate objective 8 (Hospitalisation rate in relation to prevalence in children), each represented in a separate simulation scenario, with one observation per scenario. These are summarised in Supplementary Table 5. To obtain a continuous function relating hospital incidence rates to prevalence, linear interpolation between data points was performed. To convert hospital incidence rates to community severe malaria incidence, the hospital admission rates was divided by the assumed proportion of severe episodes representing to hospital (48%). There was assumed to be no effective treatment of uncomplicated malaria episodes or malaria mortality.

Loss function: squared deviation
The loss function is denoted as the log of residual sum of squares where ! -is the access to treatment of severe cases (0.48, estimated in base model), E #,' @ is the scenario predicted rate of severe episodes per 1000 person-years for age group & = 1 (0-9 years), and parameter E #,' * is the interpolated observed rate of severe episodes per 1000 person year given by where ) #,' @ is the predicted prevalence summed over all surveys, ) 2 and ) 3 are the observed prevalences above and below the predicted prevalence) #,' @ , respectively and E 2 and E 3 are the corresponding severe episode rates to the observed prevalences.
The predicted prevalence is given by where ) #,' @ is the total number of parasite positive predicted and % #,' @ are the total number of hosts (division by 24 to give mean values). The predicted rate of episodes per 1000 person year is given by where X #,' @ is the number of severe cases predicted and with division by 2 to convert to from 2 years to 1 year and the division by 24 to give mean number of hosts.

Data
For objective 9 (Age pattern of hospitalisation), a subset of the data collated by Marsh and Snow (1999) 19 (see objective 8) is used. Detailed age-specific severe hospital admission rates were available for 5 of the sites (Supplementary Table 6). The patterns of incidence by age were summarised by age in 1-4 and 5-9 year-old children and compared with 1-11 month old infants by calculating the relative risk. Of the five sites, four were selected for fitting objective 9 based on the predicted prevalence. Baku, The Gambia was excluded as the very low (2%) prevalence here could not be matched. * Period prevalence rather than incidence because precise matching of each community member to hospital admission was not possible. Rates as admission per 1000 children per year (95% CI). + Precise dates of birth were unobtainable for five children. $Defined as child admitted with primary diagnosis of malaria and Blantyre coma score of 2 or less. § Defined in child with primary diagnosis of malaria and haemoglobin of 5.0g/dL or less on admission. Rates for Siaya derived from person-years exposure to risk and admissions for period Nov 1, 1994 to Oct 31, 1995.

Loss function: Residual sums of squares for relative risk
We denote a loss function based on residual sum of squares: where EE # is the relative risk of severe episode for age group & compared to age group 1 and EE # * is the predictive relative risk for age group k compared to age group 1. The predicted relative risk is given by where X # * is the number of severe cases predicted for age group & and % # * the total number of hosts for age group &.
2.1.10 Malaria specific mortality in children (< 5 years old)

Data
For objective 10 (Malaria specific mortality in children <5 years old), mortality data were derived from verbal autopsy studies in sites with prospective demographic surveillance and were adjusted for the effect of malaria transmission intensity on the sensitivity and specificity of the cause of death determination 25 . The data are provided in Supplementary Table 7. The odds ratio for death of a case in the community relative to that in hospital was estimated by fitting to the malaria-specific mortality rates in children less than five years of age assuming the published hospital case fatality rate. Nine sites for which both malaria-specific mortality rates and seasonal transmission patterns were available were included for calibration.
There is one observation per study site and simulation scenario, and predicted values are for one survey at the end of 2 years.
where \]E ' is the observed direct mortality rate for age group 1 (0-5 years) and \]E ' @ is the predicted direct mortality rate for age group 1. The predicted direct mortality rate is given by where \\ ' @ is the number of direct malaria deaths cases predicted for age group 1 and % ' * the total number of predicted hosts for age group 1. The division by 2 is to convert to yearly rate as the survey was conducted at the end of 2 years.

Data
For objective 11 (indirect malaria infant mortality rate), a subset of the data collated by Marsh and Snow (1999) 19 (see objective 8) was used. These constitute a library of sites for which entomologic data were collected at least monthly and all-cause infant mortality rates (IMR) were available. There is one observation per scenario: all cause infant mortality rate (returned as a single number over whole intervention period).

Loss function: Residual sums of squares
The loss function minimises the log sum of squares: where ^\]E ' the observed indirect mortality rate for age group 1 and ^\]E ' @ is the predicted indirect mortality rate for age group 1.   It is assumed that 48% of severe malaria cases seek official care at a heath care facility (hospital). We assume a probability of effective treatment within 14 days of uncomplicated malaria of 36%.
Supplementary Figure 24. Yearly number of malaria-related deaths as a function of PfPR2-10, displayed by parameterization and age group (in years). Malaria mortality incidence is presented in terms of the yearly number of deaths in a population of 1000 individuals. For the OpenMalaria model both deaths directly attributed to malaria (dotted curve) and all deaths associated with malaria (including both deaths directly attributable to malaria and those associated with comorbidities) are shown (full line). The shaded areas show the minimum to maximum range. Figure 25. Yearly incidence of clinical malaria in a seasonal transmission setting as a function of age, displayed by transmission intensity (PfPR2-10) and parameterization. Clinical incidence is presented in terms of the yearly number of events per person. The PfPR2-10 categories include simulated prevalences of 2.5-3.5%, 9-10%, 28-32%, and 47-53% labeled as 3%, 10%, 30%, and 50%, respectively. The shaded areas show the minimum to maximum range.

Supplementary Figure 26. Yearly incidence of clinical malaria in a perennial transmission setting as a function of age (in years), displayed by transmission intensity (PfPR2-10), and parameterization.
Clinical incidence is presented in terms of the yearly number of events per person. The PfPR2-10 categories include simulated prevalences of 2.5-3.5%, 9-10%, 28-32%, and 47-53% labeled as 3%, 10%, 30%, and 50%, respectively. The shaded areas show the minimum to maximum range.

Supplementary Figure 27. Yearly incidence of total severe malaria in a seasonal transmission setting as a function of age (in years), displayed by transmission intensity (PfPR2-10), and parameterization.
Incidence is presented in terms of the yearly number of events per 1000 person-years. It is assumed that 48% of severe malaria cases seek official care at a healthcare facility (hospital). The PfPR2-10 categories include simulated prevalences of 2.5-3.5%, 9-10%, 28-32%, and 47-53% labeled as 3%, 10%, 30%, and 50%, respectively. The shaded areas show the minimum to maximum range.
Supplementary Figure 28. Yearly incidence of total severe malaria in a perennial transmission setting as a function of age (in years), displayed by transmission intensity (PfPR2-10), and parameterization. Incidence is presented in terms of the yearly number of events per 1000 personyears. It is assumed that 48% of severe malaria cases seek official care at a healthcare facility (hospital). The PfPR2-10 categories include simulated prevalences of 2.5-3.5%, 9-10%, 28-32%, and 47-53% labeled as 3%, 10%, 30%, and 50%, respectively. The shaded areas show the minimum to maximum range.