Factors associated with the difference between the incidence and case-fatality ratio of coronavirus disease 2019 by country

Coronavirus disease (COVID-19) has been spreading all over the world; however, its incidence and case-fatality ratio differ greatly between countries and between continents. We investigated factors associated with international variation in COVID-19 incidence and case-fatality ratio (CFR) across 107 northern hemisphere countries, using publicly available COVID-19 outcome data as of 14 September 2020. We included country-specific geographic, demographic, socio-economic features, global health security index (GHSI), healthcare capacity, and major health behavior indexes in multivariate models to explain this variation. Multiple linear regression highlighted that incidence was associated with ethnic region (p < 0.05), global health security index 4 (GHSI4) (beta coefficient [β] 0.50, 95% Confidence Interval [CI] 0.14–0.87), population density (β 0.35, 95% CI 0.10–0.60), and water safety level (β 0.51, 95% CI 0.19–0.84). The CFR was associated with ethnic region (p < 0.05), GHSI4 (β 0.53, 95% CI 0.14–0.92), proportion of population over 65 (β 0.71, 95% CI 0.19–1.24), international tourism receipt level (β − 0.23, 95% CI − 0.43 to − 0.03), and the number of physicians (β − 0.37, 95% CI − 0.69 to − 0.06). Ethnic region was the most influential factor for both COVID-19 incidence (partial \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}^{2}$$\end{document}R2 = 0.545) and CFR (partial \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}^{2}$$\end{document}R2 = 0.372), even after adjusting for various confounding factors.


Scientific Reports
| (2021) 11:18938 | https://doi.org/10.1038/s41598-021-98378-x www.nature.com/scientificreports/ million population than that of Western Pacific as of 16 August 2020 and these regional differences are not vanished even as the pandemic continues, Americas being approximately 33.3 times higher than Western Pacific as of 31 July 2021 1 . The reason of the difference should be investigated, considering Western Pacific, where the disease occurred at first, had notably lower incidence and mortality than the Americas and Europe, where most of the industrialized countries with sufficient healthcare capacity and hygiene condition are. Numerous studies investigating the risk factors for COVID-19 outcomes within countries have been published. The examined risk factors are demographic features, comorbidities, socioeconomic disparity, and environmental features 3,4 at the district level. Specifically, male sex 5-7 , older age 7,8 , comorbidities 6,7,[9][10][11] are suggested as factors that increase the risk of negative COVID-19 outcomes. Socioeconomic disparities, such as income level [12][13][14] , education 12,13 , and unemployment 14 , are reported to be associated with COVID-19. Moreover, ethnicity is suggested to be associated with the disparity of COVID-19 outcomes, although it is not verified whether the underlying cause of the disparity results from biological or socio-economic features of the different ethnicities 8,12,15 . However, there are few country-level studies investigating the possible factors for international variation in COVID-19 outcomes. Clarifying the potential country-level factors could provide evidence for policy makers to implement appropriate COVID-19 control measures, such as social distancing and lockdowns.
Using publicly available data, this study aims to identify the factors related to the international variation of COVID-19 outcomes at the country level and to measure how much each factor could explain the disease outcome, by adjusting for the national COVID-19 test rate, and the demographic and economic features.

Methods
Data extraction. We obtained the data on COVID-19 outcomes of each country, i.e., total confirmed cases, recovered cases, deaths, and number of tests performed from Worldometer Coronavirus statistics websites 16 , one of the most popular COVID-19 data sources, at 14 September 2020. We retrieved data at 14 September 2020 so that we could consider that most of the countries had gone through the first wave of COVID-19 17,18 and chances of biased results caused by possible cocirculation of flu and COVID-19 18 could be reduced. The number of total confirmed cases per million population was used as COVID-19 incidence and the number of total deaths divided by the number of confirmed cases was used as the CFR (%). Only countries in northern hemisphere were included since northern and southern hemispheres had different prevalence duration of COVID-19, and each hemisphere had different seasonality as of 14 September 2020 4 .
Information on country-level indices, namely, geographic, demographic, and socio-economic features, global health security index (GHSI), healthcare capacity, and health behaviors, were examined for possible factors, considering the results of previous studies which investigated association between COVID-19 health outcomes and each variable [2][3][4][5][6][7][8][9][10][11][12][13][14][15][19][20][21][22][23] . Specifically, information on ethnic region 24 , proportion of female (%) 25 , land area (km 2 ) 25 , median age 26 , population over 65 years of age (% of total population) 25 , total population 27 , population density (P/km 2 ) 27 , urban population (% of total population) 27 , education index 26 , GDP per capita (current US$) 25 , Gini index 25 for detection of income dispersion, international tourism receipts (% of total exports) 25 , and unemployment (% of total labor force) 25 was included in this study. Ethnic region was based on the data of ethnic categories extracted from a previously published article 24 , because recognized social standards that defined ethnic categories at the national level was absent 28 . Rawshani et al. 24 categorized ethnicity by considering geographical adjoins and evaluating each country's ethnic composition, economic development, history, and religion. The ethnic region in our research consisted of nine categories: East Asia; Europe (high income), North America and Oceania; Europe (low income), Russia and Central Asia; Latin America and the Caribbean; Mediterranean Basin; Middle East and North Africa; Nordic countries; South Asia; and Sub-Saharan Africa. GHSI was a comprehensive assessment of the health secure capability of a country to prevent and combat epidemic. The index had an overall score and comprised six categories: prevention of pathogen release (GHSI1); detection and reporting for epidemics (GHSI2); rapid response to an epidemic (GHSI3); capability of the health system to treat patients and protect healthcare workers (GHSI4); compliance with international commitments (GHSI5); and, nationwide environmental risk and public health vulnerability to biological threats (GHSI6) 29 . Each category of the GHSI and the overall scores ranged from 0 to 100, with higher scores indicating better preparedness in the corresponding category.
We collected information on healthcare capacity, such as healthcare access and quality (HAQ) index 30 , health expenditure (% of GDP) 25 , out-of-pocket expenditure (% of current health expenditure) 25 , and the number of hospital beds, nurses, and physicians per 1000 people 25 . The HAQ index analyzed the 32 causes of death that are considered avoidable in the availability of quality medical services 30 . Causes of death included various health service areas, such as vaccine-preventable diseases; epidemics and maternal and child health; non-infectious diseases; and, gastrointestinal diseases in which death is preventable by surgery 30 . The values ranged from 0 to 100, and higher values indicate that the country has a higher quality of and better accessibility to medical care 30 . Information on comorbidities and health behaviors which can contribute to COVID-19 outcomes was extracted. We included information on obesity prevalence 31 , diabetes prevalence 25 , smoking prevalence 31 , alcohol consumption 25 , and water, sanitation, and hygiene (WASH) index 32 . The WASH index assesses the safety and accessibility to water and sanitation facilities and personal hygiene levels. The indicators are independent but also interdependent. The values ranged from 0 to 100, with higher scores indicating better conditions for the corresponding factor 32 . The WHO argued that ensuring proper condition of WASH in communities, homes, schools, and medical facilities would help prevent COVID-19 transmission 33 . The WASH index that assessed personal hygiene was excluded from our analysis due to an abundance of missing values (81, 59.6%). There was no duplication between variables. All the data used in this study were publicly available. www.nature.com/scientificreports/ Statistical analysis. The analysis was conducted in country level. Baseline information of variables was assessed with median, mean, minimum, maximum, 25th and 75th percentile. Medians was used for the imputation of missing values of independent variables as the independent variables were not normally distributed. Multiple linear regression was used to identify potential factors associated with incidence and CFR. Outcome variables, including incidence and CFR, were log transformed for the multiple linear regression analysis. The zero value in the CFR (%) was imputed with 0.005 for log transformation (corresponding country: Laos, Mongolia, and Cambodia). The continuous independent variables were standardized to properly compare the effects of potential factors, as the scale of each factor was different.
Potential predictors were first identified by univariate linear regression with p < 0.25 (Tables S2 and S3 in Supplement). A backward elimination was implemented. Then, incidence model embedded variables that stood for sex, age, GDP per capita, and COVID-19 test rates, to verify the effects of potential factors on the disease even after adjusting for the national demographic and economic features, and COVID-19 test rates. CFR model embedded COVID-19 incidence instead of COVID-19 test rates, considering incidence could affect mortality by bringing burden to national capacity against COVID-19 and medical system. Multicollinearity was considered (variance inflation factors (VIF) < 10) for the variable selection. Thus, variables with VIF ≥ 10 were excluded for the final model. The outcomes were presented with beta coefficients (β), 95% confidence interval (CI) of beta coefficients, and partial R-squared statistics. Partial R-squared statistics implicated the explanation portion of each variable in the model. The explanatory power of the model was assessed using adjusted R-squared statistics.
The sub-analyses on 136 countries, including countries in both northern and southern hemisphere, selected variables as the main analysis did. Multiple linear regression on log transformed incidence (Table S4 in Supplement) and log transformed CFR (Table S5 in Supplement) were conducted. We also performed the sub-analyses by each ethnic region respectively, except Mediterranean Basin (N = 5), Nordic countries (N = 4), and Sub-Saharan Africa (N = 4) region because the number of countries included in corresponding regions was less than 10. By each ethnic region, COVID-19 incidence and CFR were dichotomized with the median value (0: lower incidence [CFR]; 1: higher incidence [CFR]). A backward elimination process was implemented on the model with potential factors as which identified by univariate logistic regression with p < 0.25. The model on incidence included variables that stood for sex, age, GDP per capita, and COVID-19 test rates, while the model on CFR embedded COVID-19 incidence instead of COVID-19 test rates. Multicollinearity was considered (VIF < 10) when selecting variables for the final model. Multiple logistic regression was conducted, and the results are suggested in Supplementary Tables S6 and S7. All statistical analyses were performed using R version 4.0.2 (R foundation for Statistical Computing, https:// www.r-proje ct. org). We used QGIS version 3.10.13 (QGIS Development Team, http:// qgis. osgeo. org) for mapping. The institutional review board (IRB) of Korea University granted exemption for this study (IRB exemption number: KUIRB-2020-0281-01).

Results
Characteristics of total selected countries. There were 215 countries or regions reported on Worldometer site on 14 September 2020. Countries or regions with less than one million population (n = 59), those with lower value than 0.001 for total test per population (n = 17), those with more than 10% missing independent variables (n = 3), and those in the southern hemisphere (n = 29) were excluded. Finally, 107 northern hemisphere countries were included for analysis ( Fig. 1). Lists of the countries included in each ethnic region are summarized in Table 1. Among 107 countries, the most frequent ethnic regions were "Middle East and North Africa (21, 19.6%)" whereas "Nordic countries (4, 3.7%)" and "South Asia (4, 3.7%)" were the least frequent. Table 2 summarized the inherent characteristics, namely, the number of tests for COVID-19 performed per one million population (COVID-19 test rate); demographic, socio-economic features; Global Health Security capabilities; healthcare capacities; and, personal health-related features, of the 107 countries. COVID-19 test rate was 55,710.0 (25- Table S1 in Supplement. The median value of incidence in "East Asia" was 95.0 (25-75th percentile 33.0-1223.5) whereas that in "Europe (low income), Russia and Central Asia" was 4810.5 (25-75th percentile 2828.0-7356.5). The CFR in "East Asia" was 1.3 (25-75th percentile 0.0-1.8) whereas that in "Europe (high income), North America and Oceania" was 3.8 (25-75th percentile 2.4-6.5).
Factors related to COVID-19 incidence. The results of the multiple linear regression analysis to investigate the significant factors affecting COVID-19 incidence are presented in Table 4. The explanatory power of the model was 63.7% (adjusted R 2 = 0.637). Ethnic region (p < 0.05), GHSI4 (β 0.50, 95% CI 0.14-0.87), popula-   the CFRs of all the other ethnic regions were significantly higher than that of "East Asia, " even after adjusting for sex, age, economic status, and COVID-19 incidence. The beta coefficient of "Latin America and the Caribbean" region was the highest among the ethnic regions (β 3.77, 95% CI 2.62-4.92). Ethnic region had the highest partial R-squared statistics among the factors (partial R 2 = 0.372). Countries with higher GHSI4, higher proportions of population over 65 years of age, fewer international tourism receipts, and fewer physicians were likely to have higher CFRs and ethnic region explained the largest part of the model.

Discussion
An analysis was conducted with publicly available data to identify the factors associated with COVID-19 incidence and CFR. Possible factors, namely, COVID-19 test rate; geographical, demographical, and socio-economic variables; degree of preparedness for epidemics; healthcare capacity; and, personal health-related variables, were evaluated. Ethnic region was the most influential factor for both COVID-19 incidence and CFR, even after adjusting for the national demographic and economic features and COVID-19 test rates/incidence. The results of the sub-analysis including countries in both hemispheres also showed that ethnic region accounts for the largest part in the incidence (partial R 2 = 0.511) and CFR models (partial R 2 = 0.322) (Tables S4 and S5 in Supplement). Furthermore, sub-analyses by each ethnic region did not reveal any significant factors related to incidence and CFR consistently (Tables S6 and S7 in Supplement). Our results are possible to support the hypothesis that East Asia could have evolved for a long time to be more resistant to SARS-CoV-2, suggested by Yamamoto and Bauer 2 . Yamamoto and Bauer 2 proposed that, differences in (1) socio-behavioral aspects, (2) virulency of viruses, (3) evolutionary history related to selection of people by the virus, or (4) hygienic conditions could cause discrepancies in COVID-19 outcomes between Central Europe and East Asia. In our results, ethnic region was the most influential features explaining the international variation of the disease, even after considering socio-behavioral aspects and hygienic aspects, with the WASH index, as possible factors. As COVID-19 control policies were implemented to constrain socio-behavioral aspects, national differences in policies could partly explain the differences in incidence 2,19 . However, the national differences in policies could not fully explain the differences in the CFRs across countries 2 . Chaudhry et al. 19 also suggested that government actions, such as rapid border closing and complete lockdowns, could not sufficiently explain COVID-19 mortality. Furthermore, since there are insufficient virological studies investigating SARS-CoV-2 worldwide 2 , the hypothesis that highlighted the differences in pathogenicity of viruses across regions is hardly supported. Therefore, our findings could support the 'evolutionary hypothesis' among the four hypotheses to explain these regional variations suggested by Yamamoto and Bauer 2 . That is, the difference in native susceptibility of the hosts in each region may be a possible  Table 3. COVID-19 health-related outcomes of total selected countries. COVID-19 incidence total confirmed cases of COVID-19 per one million population, COVID-19 mortality deaths due to COVID-19 per one million population. www.nature.com/scientificreports/ factor to explain these regional variations of incidence and fatality of COVID-19. Asians living in ' Asian ethnic region' including Chinese may have lower susceptibility to SARS-CoV-2, for any reason including the possibility of exposure to a pathogen with a similar antigenicity in the past. However, our data and analysis in this study may be insufficient to rule out other possible hypotheses and explanations. We are not against the results of previous studies 34 that the impact of the effective control measures against COVID-19 in East Asia could have resulted in lower incidence and CFR. As our study being country-level ecological study, we aim to suggest a hypothesis, not to prove hypothesis. Therefore, further studies at the individual levels are required to derive direct evidence for different susceptibilities to COVID-19 across ethnic regions, considering collinearity between ethnic region and control measures. GHSI4, which evaluated the health system, was associated with a higher COVID-19 incidence and CFR. Our results support the argument that GHSI is not sufficiently predictive of pandemic response 35,36 , and additional factors that better estimate pandemic preparedness should be embedded in the index 36 . However, we should be cautious while interpreting the predictiveness of GHSI for the vulnerability to the epidemic as the COVID-19 pandemic is still ongoing.
Countries with better water safety levels were likely to have higher incidence. These results support the hypothesis that poorer hygienic conditions are associated with higher resistance to infectious disease 2 . However, the observed negative effects of the WASH Index should be interpreted with caution. The association between water security and incidence might have resulted because countries with high water security usually had high economic statuses, given that GDP per capita and WASH index for water safety had a positive correlation (r = 0.47, p < 0.001). Therefore, the authors are not convinced of the negative effect of water safety and support that water security should be ensured for tackling the pandemic 37 .
Countries with higher population densities were expected to have higher incidences. In common perception, dense areas could be vulnerable to closer contact, which leads to higher caseloads in directly transmitted Medians and beta coefficients of COVID-19 incidence by ethnic region in 107 northern hemisphere countries. The size of red circle indicates the beta coefficients, which were determined by using multiple linear regression analysis on log transformed COVID-19 incidence, having "East Asia" region as a reference (*p < 0.05, **p < 0.01, ***p < 0.001).

Figure 3.
Medians and beta coefficients of COVID-19 case-fatality ratio (%) by ethnic region in 107 northern hemisphere countries. The size of red circle indicates the beta coefficients, which were determined by using multiple linear regression analysis on log transformed COVID-19 case-fatality ratio, having "East Asia" region as a reference (*p < 0.05, **p < 0.01, ***p < 0.001). www.nature.com/scientificreports/ infectious diseases. Our study supports this common perception, which is also supported by Bhadra et al. 20 and Coşkun et al. 21 . However, a study that analyzed 913 U.S. metropolitan counties 22 disputed this perception by showing that the connectivity between counties was significantly associated with incidence rather than the population density. As studies are usually performed within countries [20][21][22] , further studies at the country level are needed to clarify whether population density is associated with the disease outcomes. As examined by several other studies 19,38 , older age was associated with a higher CFR. Older patients with COVID-19 are more vulnerable to progress to severe disease 39 and a greater number of patients with severe disease could burden the national economy and healthcare capacity. Therefore, the government should have great interest in older patients with COVID-19.
Countries with fewer healthcare professionals, especially physicians, were vulnerable to CFR. It is possible to consider that an increase in CFR, resulting from the lack of healthcare professionals, could lead to the collapse of the healthcare system. Retaining a sufficient number of healthcare workers is essential to win this war 40 . Therefore, the government should secure the safety and well-being of healthcare professionals in physical and psychological aspects 40,41 .
Countries with higher usual tourism receipts were likely to have lower CFRs. Contrastingly, Farzanegan et al. 23 suggested that countries with higher inbound and outbound tourism are more likely to have higher number of confirmed cases and deaths. Most European countries enforced border control measures at a later stage as compared to Asia-Pacific countries 42 . Since the extraction date of COVID-19 outbreak data we used is about five months later than that of a previous study 23 , it is possible for the effect of border control to be fully reflected in our study. However, effect of border control could not be fully considered, further studies which consider the characteristics of border controls implemented by countries are required.
Our study has several limitations. As COVID-19 pandemic is still ongoing, the data we used has limitation with respect to reflecting the current situation. Because the information related to COVID-19 was extracted only once, i.e., on 14 September 2020, information after this date cannot be applied in our analysis. However, by setting 14 September 2020 as data capture date, we could consider that most of the countries had gone through the first wave of COVID-19 17,18 and we could reduce the chances of biased results because of possible cocirculation of flu and COVID-19 18 , and because of possible effect of vaccination. We did not include national control measures as potential factors, as mitigation policies themselves have limitations in comparing effectiveness. Specifically, each country had various kinds of policies at different intensities 43,44 , different initiation times 43,44 , and various degrees of compliance of the public to the policy [45][46][47] . Age-standardization, which is useful to fairly compare the disease outcomes across countries 48 , could not be implemented in our study. This was because each country reported the outcomes with different age standards, and some countries did not report based on age group. However, including age-representing variables in the analysis models must have adjusted the differences in age structure among countries to some degree. Finally, we hardly support a definitive judgement on the effect Table 4. Multiple linear regression analysis on log transformed COVID-19 incidence. COVID-19 incidence total confirmed cases of COVID-19 per one million population, β beta coefficients, SE standard error, 95% CI 95% confidence interval, COVID-19 test rate number of COVID-19 tests performed per one million population, GDP Gross Domestic Product, GHSI Global Health Security Index, WASH: Water, index that assesses the safety and accessibility to water. www.nature.com/scientificreports/ of ethnicity across countries, as the categories of ethnic region we used were not based on social consent but were ones used by a single published article 24 . However, because social standards in ethnic category are absent, the ethnic grouping we used was the best option to handle the ethnic categories. Genetic factors could not be investigated in our study because data regarding genetic factors related to COVID-19 was unavailable. This study is meaningful in examining the association of ethnicity with COVID-19 health-related outcomes at the country level and highlighting that ethnicity could largely explain COVID-19 incidence and CFR. Moreover, the authors consider that this work could be used as a trigger for further research investigating the effect of different genetic predispositions across ethnicities on COVID-19 outcomes.