Associations between the spatiotemporal distribution of Kawasaki disease and environmental factors: evidence supporting a multifactorial etiologic model

The etiology of Kawasaki Disease (KD), the most common cause of acquired heart disease in children in developed countries, remains elusive, but could be multifactorial in nature as suggested by the numerous environmental and infectious exposures that have previously been linked to its epidemiology. There is still a lack of a comprehensive model describing these complex associations. We present a Bayesian disease model that provides insight in the spatiotemporal distribution of KD in Canada from 2004 to 2017. The disease model including environmental factors had improved Watanabe-Akaike information criterion (WAIC) compared to the base model which included only spatiotemporal and demographic effects and had excellent performance in recapitulating the spatiotemporal distribution of KD in Canada (98% and 86% spatial and temporal correlations, respectively). The model suggests an association between the distribution of KD and population composition, weather-related factors, aeroallergen exposure, pollution, atmospheric concentration of spores and algae, and the incidence of healthcare encounters for bacterial pneumonia or viral intestinal infections. This model could be the basis of a hypothetical data-driven framework for the spatiotemporal distribution of KD. It also generates novel hypotheses about the etiology of KD, and provides a basis for the future development of a predictive and surveillance model.

Population at risk, ethnicity, age distribution, population density and rural effect. Information on population, ethnicity, medium household income and age distribution was obtained from the Canadian Census 2006 [24][25][26] , 2011 27  FSAs without a Census population count were not considered in the analysis. Population density was calculated based on the land area file that contains the FSAs codes and sizes in square-km, derived from the FSA cartographic boundary file available through the University of Toronto Libraries 30 .
Ethnicity data in the Canadian Census is self-reported, based on the individual's perception of their ethnic ancestry and respondents are allowed to select multiple ethnicity 31 . Given the strong predilection of KD for children of East Asian ethnicity, we considered the percentage of the population who self-reported Chinese, Japanese, Koreans and Taiwanese as one of their ethnicity. Finally, the percentages of the population in each FSA who live in postal codes classified as "rural" was calculated, with rural being defined as FSAs with a zero in the second digit.

Healthcare encounters associated with infectious diseases and allergies. Population-level inci-
dences of health care encounters associated with specific infections or allergy-related illnesses were obtained from both DAD and the voluntary National Ambulatory Care Reporting System Metadata (NACRS), which is also maintained by CIHI. Participation in NACRS is determined at the provincial level and achieves complete capture of ambulatory care in participating provinces. As it was not mandatory for all provinces to submit data to NACRS, models containing healthcare encounters for infections or atopic diseases as covariates were adjusted for the percentage of mandatory submission of infection and atopy data for each respective region. Data were obtained for all healthcare encounters across Canada which were associated with an ICD-10 diagnostic code for selected infections and allergy-related illnesses (see Table 1) in hospital admissions, and outpatient and community-based clinic encounters. These data were used to calculate monthly counts for codes and groups of codes. Infectious diseases and organisms included in this analysis were selected based on their prevalence in Canadian children. Infectious diseases were grouped in 2 separate ways: first, according to the type of etiologic agent (viral, bacterial, fungal infections or infections of unknown origins) and the affected physiological system (upper respiratory tract, lower respiratory tract, gastrointestinal tract and skin) and second, by the specific pathogen involved in the healthcare encounter. This dual system allowed us to consider both the contribution of specific pathogens but also of ailments for which clinical care rarely needs the identification of the specific pathogen and as such the majority of such encounters remain of indeterminate cause (e.g. viral intestinal infections).
Weather and pollution data. Monthly weather data were obtained from the European Centre for Medium-Range Weather Forecasts (ECMWF) reanalysis ERA-Interim product. ERA-Interim uses an integrated forecasting system model and the 4D-variational data assimilation system 32 . We included variables that are known or suspected to directly impact human health, or influence disease vectors and/or the growth and transport of aeroallergens (see full list in Table 1). Air pollution data were obtained from the ECMWF Copernicus Atmospheric Monitoring Service (CAMS) Reanalysis [2018] 33 . The CAMS Reanalysis provides information on aerosol and reactive gases as total column values and are based on assimilation of satellite observations into a forecasting model. We included particulate matter smaller than 2.5 µm (PM2.5), nitrogen dioxide, ozone, sulfur dioxide and carbon monoxide from the CAMS reanalysis. Given their similar association with the incidence of KD, nitrogen dioxide and sulfur dioxide levels were combined into a single variable by summing up the two variables. The reanalysis data were obtained on a 1.25-degree longitude and 1.25-degree latitude grid, and values for grid cells with a center inside an FSA were aggregated to obtain data per FSA.
Atmospheric biological particles. Daily  www.nature.com/scientificreports/ don, Toronto, Ottawa, Moncton and St. Johns). Previous studies have shown that biological atmospheric particle concentration can be regionalized up to ranges of ~ 100 km from the sampling site (~ 30,000 km 2 ) for analysis 34 . As such, for each FSA, atmospheric biological particle data were assigned from the closest city (i.e. the closest particle collection point). These data are only available during the growing season of the year. Pollen counts before and after the growing season were set to zero.
Data preparation for modelling. Due to the different scales and units of the covariates, all covariates were centered to mean zero and scaled to variance one before the analysis, except for wind direction which was not centered to conserve the wind direction.
Aggregation of spatial units. All data except for weather and pollution data was originally available on an FSA level. FSA boundaries were obtained from the 2011 Canadian Census 35 . FSAs designate a geographical area based on the first 3 digits of the postal code (equivalent to the US zip code) that corresponds to an average of ~ 20,000 individuals. In order to maintain sufficient case density for analysis, Northern regions, where population density was extremely low, were excluded. The remaining FSAs were combined using a data driven strategy.
To begin with, the FSA with the smallest population (R1B with 10 inhabitants) was selected. FSAs that share at least one common boundary with this initial FSA were identified (R1A and R3C). The neighbouring FSA with the smallest population (R3C with 1620 population) was then combined to the initial FSA, i.e. the spatial polygons of R1B and R3C were aggregated by dissolving their internal boundaries. We obtain a new map, for which we have one new area (R1B ∩ R3C) with a population of 1630. From this new map, we again selected the FSA with the smallest population (V4G). This FSA had 5 neighbours from which the one with the smallest population was selected for spatial aggregation. This procedure was repeated until FSAs were combined into 100 groups. The supplementary map (online auxiliary material) shows the combined regions as well as lists of all FSAs that were aggregated for each region. Spatial aggregation was performed in order to preserve the granularity of the data and achieve sufficient case density for modelling.
Bayesian disease model. A hierarchical Bayesian model framework is used to address the issues posed by the sparseness of the data and the spatial dependencies between the data 36,37 . The standardized incidence rate for KD in each area and month was modelled with the Besag-York-Mollie (BYM) model 38 , which includes spatial random effects that account for the spatial autocorrelation of the data. Temporal effects have been added to the model sequentially. We added a linear trend, dynamic non-linear trend 39 , specified through a random walk model of order 1, and a seasonal term. More details on the model parameterization can be found in the supplement.
To approximate posterior marginals, we used an integrated nested Laplace approximation (INLA) approach, which is a deterministic algorithm for Bayesian inference proposed by Rue et al. 40 The computation was www.nature.com/scientificreports/ performed with the R-INLA interface 41 . To select the best temporal structure of the model and to select variables, we compared the Watanabe-Akaike information criterion (WAIC) of different models (see Table 1). The calculation of WAIC followed Gelman et al. 42 .
All statistical analyses and maps creation for this project were performed using R version 4.0.2 (https:// www.R-proje ct. org). Base map and data obtained from OpenStreetMap and OpenStreetMap Foundation (www. opens treet map. org).

Results
KD incidence and spatial and temporal variation. From April 2004-March 2017, a total of 5,882 acute hospital admissions for KD were identified in the discharge abstract database maintained by the Canadian Institute of Health Information (CIHI) based on the ICD-10-CA standard code M30.3, of which 266 were excluded according to our previously designed algorithm to identify readmissions and between-hospital transfers (Fig. 1) 22 . The final number of acute admissions included in this study was 5,616, with an annual incidence of 22.3 per 100,000 children under the age of four, 7.2 per 100,000 children from 5 to 9 years of age and 0.8 per 100,000 children from 10 to 18 years of age. Spatiotemporal variations of KD were modelled with a Bayesian hierarchical model (see Expanded methods and Supplementary information for details), which includes covariates related to environmental exposure (list in Table 1). We compared several spatiotemporal models that accounted for the temporal structure differently, but were all based on the Besag-York-Mollie model to account for spatial autocorrelation 38 . The final multivariable model included a linear and non-linear time component, and covariates that improved the performance of the model. Standardized incidence rates obtained from the mean of the posterior distribution of the temporal and spatial random effects are presented in Fig. 2. The model suggests that the risk for KD was highest between late 2010 and 2011, and reached its lowest point at the end of 2007 ( Fig. 2A). As expected, a distinct seasonal distribution with higher risk for KD during the winter months and relatively lower risks at the end of the summer months was found (Fig. 2B). Spatial areas with increased risk for KD were identified in some areas around Toronto and Halifax (spatial standardized incidence rate of larger 1.5 and 1.8 respectively), with a posterior probability > 99% (Fig. 2C). Further hotspots were found in region of Northern Ontario and the city of Calgary (spatial standardized incidence rate > 1.3). A detailed map including posterior probabilities is provided in Appendix 2.

Environmental factors and exposures associated with distribution of KD. For each of the covari-
ates, a base model that included only spatial and temporal effects and adjusted for distribution of age and ethnicity in the population was compared with models that additionally included one of the covariates of interest based on model fit, measured by the Watanabe-Akaike information criteria (WAIC). The WAIC for the base model was 23,134.0, while the adjusted base model WAIC was 23,128.0 (see top of Table 1). Only variables that reduced the WAIC from the base model by more than 1, i.e. improved the model fit, were included in a multivariable model. In a second step, the deletion of each variable from this multivariable model was tested in a backward selection to identify and remove variables that did not contribute to the multivariable model fit (see Table 1 and Fig. 3).
Both adjustment factors, the percentage of population under 4 years of age and the percentage of population self-identifying as Taiwanese, Korean, Japanese or Chinese ('East Asian') were associated with increased risk for KD. In the final multivariable model, environmental factors contributing to an increased risk of KD included lower median income, lower proportion of an area classified as rural, increased atmospheric concentration of spores from Myxomycetes (slime molds), Deuteromycetes (fungi imperfecti) and algae concentrations, lower atmospheric concentration of Dothideomycetes spores and higher combined smog (NO 2 + SO 2 ) concentration in the atmosphere. Increased incidence of KD was also associated with higher incidence of healthcare encounters for atopic dermatitis, viral intestinal illnesses, bacterial pneumonia (associated with Mycoplasma, Klebsiella and possibly Haemophilus influenzae), and multisystem infections with a heterogeneous group of pathogens known to be potentially associated with hyperinflammatory or autoimmune syndromes (Campylobacter, cytomegalovirus, coronavirus, measles, Salmonella). Higher incidence of healthcare encounters for bacterial skin infections or fungal infection secondary to Candida were associated with reduced incidence of KD. The WAIC for the final multivariable model was 23,059.7. Correlation between modelled and observed KD rates. The correlation between the multi-year mean of observed and modelled KD incidence across the regions was 0.98 (Fig. 4A). The correlation across time was 0.86 (Fig. 4B). The time series of the observed and modelled KD incidences averaged across Canada (Fig. 4C) shows that the model reproduced the temporal variation well. Adding environmental factors, aeroallergen exposure and infectious agents to the model improved the observed-to-modelled correlation across years and provinces compared to our model that only adjusted to the population numbers, age and ethnicity distribution (correlation of 0.56 vs. 0.47).

Discussion
In this study, a mathematical model of the spatiotemporal distribution of KD in Canada over a 13-year period was derived. Factors found to be associated with the spatiotemporal incidence of KD included numerous environmental and infectious factors which were integrated into a comprehensive model for the distribution of KD. This study is significant both from the mathematical approach used to combine numerous dimensions of risk measured on different scales into a single model and by the high degree with which it could recapitulate the spatiotemporal distribution of KD. Such a model could eventually form the basis of a prediction and surveillance model for the future incidence of KD in a specific location. Careful examination of associated factors and www.nature.com/scientificreports/ how they contribute to the risk of KD can also substantially contribute to our understanding of the etiology and pathogenesis of KD (Fig. 5). Some well known features of the epidemiology of KD were readily apparent in the associations observed in this study, including an increased risk in children 0-4 years old (although up to 30% of cases occur after 5 years of age), presence of seasonal patterns 14 and spatiotemporal clusters [17][18][19] , as well as a higher risk in people of East Asian ancestry 11 . Numerous additional environmental factors were also identified in this study, some of which had already been reported in previous studies. Without a known etiological agent, epidemiological patterns and associations can be used to extrapolate some aspects of the etiology and pathogenesis of the disease. Positing a framework where KD is an immunological reaction to a still unidentified trigger(s) in genetically and/ or immunologically susceptible individuals, epidemiological patterns and associations can reflect the inherent susceptibility of the population, the local or global distribution of the trigger or can be reflecting the presence of factor(s) modulating the risk of developing KD when encountering said trigger(s). Some environmental factors identified in this study might fit those scenarios and are discussed below.
Beyond the genetic susceptibility to KD, which is evidenced by the strong association between Asian ancestry and various genetic polymorphisms and increased risk of KD 11 , other factors might increase the inherent susceptibility of a population to KD through early childhood experience. It has been proposed that KD partially develops through the hygiene hypothesis, where lack of early microbial experience and limited exposure to allergens may promote atopy and predisposes children to a hyper-reactive response to a normal trigger 43    Identification of KD patients and theoretical framework of the spatiotemporal matrix for use for modelling. The spatial and temporal scales were held constant for incidence of KD and the distribution of all predictive features so that data points at the same position in the matrix are aligned with each other in time and space. The temporal (1 month) and spatial (group of ~ 10 FSAs) calipers were decided based on density distribution of the incidence of KD. The example below (red bar) shows the alignment of the incidence of KD and of a generic predicting feature at a specific time and location in our spatiotemporal matrix. KD Kawasaki disease, n number, FSA Forward Sortation Area. www.nature.com/scientificreports/ study in Japan reported that children living in households with up to three persons had a 1.6 fold increased risk of developing KD compared to those from households with more than five persons 16 . The seasonal pattern of KD noted in this study was concordant with previous findings 22,44 . Seasonal variations in KD have been the subject of multiple previous studies, and is one of the most consistently described epidemiological features 14 . Multiple environmental factors found to be associated with the incidence of KD in this study also have a seasonal distribution. Thus, these may contribute, in a multifactorial manner, to the seasonal distribution of KD.
Results from previous studies have shown that some environmental exposures might be associated with the risk of KD, either positively or negatively. Several studies have reported higher incidences of KD associated with urbanized settings which might reflect the presence of modulating factors 16,21,45 . In a previous case-control epidemiological study we showed that children living in areas with more trees, farms and closer proximity to a natural body of water were at lower risk of KD than children living in areas with reduced exposures 21 . In this study, some of the factors identified for KD are consistent with these previous findings, including the presence of spatial hotspots in cities, decreased risk in rural areas and in environment with denser vegetation and increased atmospheric concentration of some biological particles.
At the same time, activation of the inflammasome via various environmental exposures might be increasing the risk of KD 46 . One such example is pollution exposure. In this study, an association between higher atmospheric concentration of SO 2 and NO 2 and increased risk of KD was noted. A recent Japanese publication reported that a 1-μg/m 3 increase in NO and SO 2 were associated with a higher incidence of KD (3.94, 95% CI 0.04-7.98 and 3.60, 95% CI 1.12-6.14 respectively) 47 . Air pollution has also been shown to negatively affect the respiratory mucosa through an interaction with several defense mechanisms leading to increased mucous production, epithelial barrier dysfunction and attenuated cytokine response [48][49][50] . The association between lower The seasonal effect illustrates a below average standardized incidence rate for KD in September and a high risk in the winter months. (c) Mean spatially autocorrelated random effect, expressed as area-specific standardized incidence rate. Areas with standardized incidence rate larger than 1 have an above average risk compared to the rest of Canada, while standardized incidence rate smaller than 1 indicate a below average risk. The red colours in the Greater Toronto Area and around Halifax for example indicate a 50 to 90% increase of risk. For better visibility of small regions, an interactive map can be found in the online appendix. The spatial units (of the analysis regions consisting of several forward sorting areas) are indicated with boundaries and grey filling. White regions were added to the plot for easier orientation. For uncertainty estimates, see online appendix. www.nature.com/scientificreports/ household income and increasing KD risk found in this study could also be related to activation of the inflammasome. Children from lower socioeconomic areas are known to have generally greater exposure to pollution, dust, toxins and allergens, and are at higher risk of multiple atopic and inflammatory disease, such as asthma and KD because of those exposures 16,51,52 . Beyond susceptibility, case distribution and modulation, environmental associations might point to potential targets for the trigger(s). Higher atmospheric levels of Deuteromycetes (fungi imperfecti) and Myxomycetes (slime molds), both spore-producing, were found to be associated with higher incidences of KD, as were greater atmospheric concentrations of algal spores. Many spore-producing fungi and some algae have allergenic properties, and are known to be capable of initiating an immune response in human in addition to activating the inflammasome 53,54 . In a mouse model mimicking KD, intraperitoneal injection of Candida albicans extract has been shown to induce coronary arteritis 55 . Fungi are also capable of producing MAMPs such as beta-glucans and chitin, which may stimulate an inappropriate immune response through pattern recognition receptors 56 .
There is also a longstanding recognition that a substantial number of patients with KD present either with a documented infection, a recent history of an infectious illness or with infectious disease symptoms 57 . Numerous previous studies have postulated that one or more infectious agents is/are the trigger for KD, but no consistent  Table 1). The values represent the change in standardized incidence rates for a 1 standard deviation change in the predictor. A value below 1.0 (blue) indicates a negative association between the predictor and the standardized incidence rate for KD, and values above 1.0 (red) indicate a positive association. The background color of the labels indicates the type of predictor, with grey representing genetic and demographic factors, blue representing atmospheric pollution, red representing the incidence of healthcare encounters for various causes and green representing atmospheric concentration of biological particles. at.c atmospheric concentration, HE healthcare encounters. www.nature.com/scientificreports/ or definitive findings have been reported 58 . Both respiratory and gastrointestinal diseases have been reported with KD and, indeed, in a previous case-control study, children with KD were found to exhibit respiratory and gastrointestinal symptoms in a higher proportion than controls up to 8-31 days before the appearance of the first KD symptoms 21 .
In this study, we identified groups of healthcare encounters for infections that were associated with increased incidence of KD. One interesting finding is the association between viral intestinal infections and increased incidence of KD. In North America, the most common causes of viral gastroenteritis (i.e. rotavirus, adenovirus and norovirus infections) exhibit a winter peak; notably in our previous environmental case-control study, rotavirus vaccination was associated with a lower risk of KD 21 . However, an autumn or spring peak is seen in other parts of the world, consistent with the global epidemiology of KD 59 . The association between viral intestinal infections and increased incidence of KD possibly reflect the facilitation of the entry of the trigger into the blood stream and as such would represent a modulating factor as opposed to a potential trigger. This being said, in the absence of direct experimentation, it remains unknown whether spores or infections can actually trigger KD or whether they are additional modulators. www.nature.com/scientificreports/ This study must be viewed in light of some limitations. Firstly, this is an observational epidemiological study and, as such, associations are ecologic in nature. Any causal inferences are implied. Secondly, some dimensions of environmental risk dimensions which have previously been linked to KD, such as the microbiome, could not be included in our population-level model. Third, we were limited regarding the granularity of our model, given that KD is a rare disease and cases are dispersed in space and time. Therefore, any smaller, unpredicted changes in factor levels displayed over days or hours would not be captured. In addition, Census data were only available every 5 years, as such we were not able to fully adjust for the temporal variation in ethnicity across Canada across our study period, but fluctuations were expected to be minor. All spatial effects were regionalized, which meant that local environmental factors could not be considered. Moreover, atmospheric biological data were only available for 10 stations situated across Canada, but were extrapolated to the nearest FSA region for each of the 100 FSA groups. This limitation may potentially impact the validity of our pollen findings, as pollen dispersal is likely a local phenomenon. While KD incidence varies strongly with age, we cannot exclude that some factors may be more relevant to certain age groups than others, although no evidence of such an effect has been previously reported. The lower number of patients with KD > 5 years of age precluded a sub-analysis stratified by age. Finally, it is important to note that our proposed theoretical framework is a hypothesis and should not be construed as being confirmed.

Conclusion
Our model of the spatiotemporal distribution of KD across Canada over a 13-year period was able to explain a substantial portion of the variation in the incidence of KD over that time period and identified multiple environmental factors that could be integrated into a theoretical etiological and pathophysiological framework for the spatiotemporal distribution of the incidence of KD. This framework includes a child's susceptibility to the disease, the presence of modulating factors and the spatiotemporal distribution of the trigger(s) and/or of modulating factors. Findings from this study strengthen our understanding of the epidemiology of KD, which is likely affected by various factors reflecting the complex reality of KD.

Data availability
Both the Hospital for Sick Children Ethics Committee and the Canadian Institute for Health Information have placed legal restrictions on sharing the data used in this study. The data from this study contain personal health information and as such, disclosure and distribution, even in an anonymized format, is restricted under the Ontario Personal Health Information Protection Act (PHIPA). Some data will only be available until March 31, 2025 after which it must be destroyed as per the Non-Disclosure/Confidentiality Agreement required by the Canadian Institute for Health Information. Data used in this study can be accessed by qualified researchers who meet the criteria for access to confidential health information. In addition to contacting the PI to access the data, requestors will be required to obtain approval from the Hospital for Sick Children Ethics Committee and the Canadian Institute for Health Information Data Access Program. The authors are legally prohibited from