Introduction

On the last day of 2019, a viral outbreak of unknown origin was detected in a seafood market in Wuhan City, Hubei Province, China1. This virus, later named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2; Coronaviridae) and responsible for the clinical disease known as COVID-19, has now spread around the globe and declared a pandemic by the World Health Organization (WHO) on March 12, 2020. To date (March 1, 2021), it has been detected in 219 countries, with about > 115 million people infected and > 2.5 millions deaths worldwide2. The global scale and severity of the impacts of this disease, despite being at an intermediate stage of development, is unprecedented, and studies have suggested that it might take more than a decade for the world to recover from its effects, both socially and economically3.

Upon appearance of COVID-19, several studies have examined the transmission dynamics of this disease4. While some have documented the route of transmission through human-to-human contact, others have examined the role that some environmental factors may play in facilitating the rate of spread of the disease through the analysis of temporal and spatial relationships of these factors with COVID-19 transmission rate. Most of these studies have reported a negative relationship between transmission rate and several proxies of temperature and humidity, suggesting that the disease spread is enhanced in colder and drier climates5,6,7,8,9.Other environmental variables have received less attention and results have been inconclusive or differed among countries. For example, one study reported an inverse relationship between COVID-19 transmission and wind speed in Iran10, while global studies have found no significant association between both variables11,12. A negative relationship of the disease transmission with solar radiation has been reported in Iran10, and a positive relationship with the concentration of atmospheric pollutants was found in China13. Overall, and despite the rapid response of the scientific community to understand the transmission of COVID-19, the role that environmental variables play in the disease dynamics remains an open question that requires further evidence across the world.

The goal of this study was to examine the potential role of multiple environmental variables in COVID-19 transmission rates and patterns in Chile. Environmental variation within Chile is unique due to the particular geography of this country, which includes an altitudinal range of ≈ 7000 m from the sea level to the top of Aconcagua mountain, and a ≈ 40º latitudinal gradient that covers 6 climatic zones, including desert, semiarid, mediterranean, marine west coast, tundra and ice sheet. At the same time, population across the country shares a common social behaviour, and regulations are established by a single national authority, allowing the evaluation of environmental variables under relatively consistent socio-economic conditions14. We thus aim to provide information about COVID-19 transmission across a wide range of environmental variation within a single country that may help understanding the dynamics of this disease.

Results

Summary of study area characteristics and COVID-19 transmission data

Mean temperature varied between − 11.99 °C (polar macroclimate) and 23.69 °C (semi-arid); relative humidity, between 2.56% (desert) and 96.83% (polar); atmospheric pressure, between 600.29 mbar (semi-arid) and 1062.71 mbar (marine south west); and wind speed, between 0 km h−1 (mediterranean) and 64.01 km h−1 (polar) (Table 1). Absolute population size ranged from 137 inhabitants (Antarctica) to 645,909 inhabitants (Puente Alto, mediterranean) (Fig. 1; see Supplementary Table S1 online). The first registered COVID-19 infection case in Chile occurred on February 23, 2020; to the data collection date (August 16, 2020), more than 387,000 infected inhabitants and 10,500 deaths have been reported. The mean infection rate ranged from 0 (several cities) to 4444 in General Lagos (desert) in week 32, and the number of weekly infections from 0 (several cities) to 2549 (Puente Alto) (see Supplementary Table S1 online).

Table 1 Mean values (standard error) of the variables recorded between February 23 to August 16, 2020 in relation to climates.
Figure 1
figure 1

Correlations plot of environmental database COVID-19. Green paths indicate positive correlations, brown paths negative correlations. TM= Average atmospheric temperature; TMax = Maximum atmospheric temperature; TMin = Minimum atmospheric temperature; RH = Relative humidity; Rain = Accumulated precipitation; AP = Atmospheric pressure; SR = Ultraviolet solar radiation; WS = Wind speed; Alt = Elevation; Den = Population density; IRC = Infection rate.

Relationship between COVID-19 and predictive variables matrices

The major statistical associations between the variables were registered without the time lags analyzed (0 days). The correlation analysis showed a strong relationship between the average, maximum and minimum temperatures, as well as between altitude and relative humidity, atmospheric pressure and solar radiation. There was no correlation between population density, wind speed and IR, and low correlation between precipitation and all other variables (Fig. 1). According to multicollinearity analysis, the variables of greatest importance were minimum and mean temperature (variance inflation factor, VIF > 10) and the least important was maximum temperature (VIF < 10). Given the above, the reduced the data matrix excluded mean and maximum temperature and altitude as predictor variables.

The training model was adjusted using the database with 3368 observations and 7 predictive variables. The hyperparameter selection allowed improving the predictions. The final tuning values used for the model were n_estimators (number of iterations in training) = 56, max_depth (maximum depth of the tree) = 4, eta (model learning rate) = 0.03, gamma (Minimal loss reduction required to perform an additional partition on a leaf node of the tree) = 0, colsample_bytree (the last parameter that we need to config) = 0.5, min_child_weight (sum of sample weight of the smallest leaf nodes to prevent overfitting) = 1 and subsample (sampling rate of all training samples) = 1 (Table 2). The predictions corresponding to week 0 showed a lowest error with 56 iterations (Table 3). The scatter plot of predicted mean relative infection (IR) versus observed values using the final model is illustrated in Fig. 2. The scatterplot considering all parameters demonstrate an acceptable prediction of IR with a R2 = 0.32 (R = 0.57). The Gain Score showed that the most important variables were minimum temperature (Tmin), atmospheric pressure (AP) and relative humidity (RH) (Fig. 3). All these final selected variables showed a highly significant negative relationship with the infection rate (p < 0.0001; Tmin r = − 0.25, AP r = − 0.23, RH r = − 0.21).

Table 2 Extreme Gradient Boosting regression modeling hyperparameters from the grid search.
Table 3 Root mean square error (RMSE) obtained for each database from the adjusted predictive model.
Figure 2
figure 2

Scatter plot of predicted IRC versus observed values using the XGBoost method.

Figure 3
figure 3

Top 7 most important variables based on the Gain Score (impurity) metric. TMin = Minimum atmospheric temperature; AP = Atmospheric pressure; RH = Relative humidity; SR = Ultraviolet solar radiation; Den = Population density; WS = Wind speed; Rain = Accumulated precipitation.

Discussion

Our results demonstrate that COVID-19 infection rate in Chile to date has been linked to 3 main environmental variables: minimum temperature, atmospheric pressure and relative humidity. Firstly, we found a negative relationship between infection rate and minimum temperature. Other studies have reported a similar, negative relationship between air temperature and the transmission of COVID-1915,16,17,18 and other respiratory diseases such as SARS19. However, a positive correlation with average and minimum temperature has been reported in Singapore, especially in the initial phase of transmission20. Others have found an indirect, positive effect of average temperature on the spread of the SARS-CoV-2 virus due to enhanced people’s mobility at higher temperatures21. These findings are particularly concerning at present in the southern hemisphere, which is entering winter and therefore lower temperatures are expected in the coming months, which could drive an upsurge of the disease.

Atmospheric pressure was the second relevant variable and it was negatively related with the spread of the SARS-CoV-2 virus. The link between atmospheric pressure and the spread of the SARS-CoV-2 virus has been studied in several countries20,21,22,23,24,25,26, since atmospheric pressure is responsible for air movement (wind), cloud formation, precipitation and humidity. Therefore, this variable has strong influence on climatic variation, generating favourable conditions for the virus spread in some cases (drought and light wind) but not in others (high humidity and strong wind). Others have provided evidence for a direct link between atmospheric pressure and the virus spread, indicating that the unusual persistence of an anticyclonic atmospheric situation (i.e., abnormally strong positive phase of the North Atlantic and Arctic oscillation) in southwestern Europe, centered in Spain and Italy during February 2020, generated conditions of drought and light wind that could have favored the faster spread of the virus compared to other European countries22. This is reinforced by the positive correlation found between atmospheric pressure and the frequency of COVID-19 cases in Mozambique25, and with several spread parameters (infection rate, effective reproduction number and compound growth rate) in 487 cities in the United States23. Such positive relationship could be related to an increase in fog associated to high pressure, which increases the humidity of the air and surfaces. However, other studies have found an inverse link between atmospheric pressure and the spread of the SARS-CoV-2 virus in Singapore and China20,26, which could be explained by the fact that high pressures can limit suspension time of viral particles in the environment26. Indirectly, atmospheric pressure could also reduce the virus spread by limiting people's mobility21. Overall, there is no consensus on the link between atmospheric pressure and the spread of the SARS-CoV-2 virus since there is evidence that describes both direct and inverse correlations, even both within the same country (e.g., Italy)24.

The negative relationship that we observed between relative infection rate and relative humidity is consistent with former evidence that high relative humidity reduces the COVID-19 viability27,28 and transmission rates7,29. Similarly, high relative humidity has been reported to reduce the survival of the influenza virus30 and the incidence of this disease8. Environmental humidity can affect viral transmission through its interaction with respiratory droplets, which act as virus containers and can remain longer in dry air31,32. Additionally, high relative humidity leads to inactivation of the viral lipid membrane, and consequently a decrease in the virus stability and transmission33,34. However, a study found a direct link between average relative humidity and the SARS-CoV-2 basic reproductive ratio in China26. Again, relative humidity can indirectly contribute to the spread of the SARS-CoV-2 virus due to its influence on people’s mobility21.

In conclusion, our study shows that climate plays a key role in the transmission of COVID-19 in Chile, a country that comprises a particularly high variation of environmental conditions. Importantly, it is highly likely that climatic conditions expected for the coming months in the southern hemisphere (i.e., lower temperature, humidity and atmospheric pressure) can favour a higher disease transmission speed. Our study and others providing information about how climatic factors can influence the spread of the disease may serve as the basis for predictive models of COVID-19 transmission through space and time, which will be highly relevant to decision-making and management of the disease.

Materials and methods

Study area

We examined data from 360 ‘communes’ or cities in Chile, which are distributed across 4200 km from north to south. Latitudes of our study area range from 17° S (Arica) to 56° S of latitude (Cabo de Hornos), and altitudes range from 8 m a.s.l. (Pacific Ocean coast) to 3962 m a.s.l. (San Pedro de Atacama, Andean mountain range). The study area covers the following five climatic zones: (i) desert (17°30′–26°00′ S), (ii) semiarid (26°00′–32°00′), (iii) mediterranean (32°00′–39°00′), (iv) marine west coast (39°00′–44°00′ S) and (v) tundra (44°00–56°00′ S); the only climatic zone excluded from the study was the ice sheet (located in the highest areas of the Andes mountain range) because of the absence of human population. In terms of macroclimates, ca. 41% of the country is temperate, 31% arid and the remaining 28% has a polar climate36 (Fig. 4).

Figure 4
figure 4

Map of Chilean distribution of (A) infection rate, (B) minimum temperature (°C), (C) atmospheric pressure (kPa), and (D) relative humidity (%).

Chilean population is 19.11 million inhabitants, of which 51% are women and 49% men. Life expectancies are 83 (women) and 78 (men) years old; 68.7% of the population is between 15 and 64 years old and 11.9% over 65 years old. The 88% of inhabitants live in urban areas and the estimated international migration rate is 12 per thousand inhabitants. The 13% of the population belongs to indigenous or native groups; 80% Mapuche, 7% Aymara and 4% Diaguita37. The population is aging as a result of the decline in fertility and the increased life expectancy38.

Chile has 16 administrative regions3539, of which the Metropolitana region concentrates the largest population (7.1 million inhabitants), followed by the Valparaiso region (1.8 million inhabitants). In contrast, the Aysén and Magallanes regions, located in the southern extreme of Chile, have the smallest population (< 200,000 inhabitants). Inhabitants > 65 years old mainly inhabit the areas with mediterranean climate in the cities of Santiago, Valparaíso and Concepción, and correspond to 6.28% of the total employed inhabitants in the country40. By 2050 it is projected that total population size reaches 21.6 million (i.e., an increase of 15.3% compared to 2020) under assumptions of birth and immigration surpassing mortality and emigration, with inhabitants > 65 years old predicted to exceed 3 million (25% of the population)38.

COVID-19 transmission data and predictive variables

We characterized the COVID-19 transmission in Chile from February 23 to August 16, 2020, based on mean relative infection rate [IR; (number of infected inhabitants per week/total population) × 100,000) of 360 cities. Data were obtained from official sources of the Government of Chile41. We extracted daily climatic data from the databases of 159 meteorological stations in Chile42 corresponding to cities with and without presence of COVID-19, for the same period; these data were averaged per week to make them comparable with variables quantifying the disease transmission. The data of the climatic variables recorded every 1 h at meteorological stations were extracted and average weekly expressing them as follows: weekly average, maximum and minimum atmospheric temperature (°C); weekly average relative humidity (%; moisture content (i.e., water vapor) of the atmosphere, expressed as a percentage of the amount of moisture that can be retained by the atmosphere (moisture-holding capacity) at a given temperature and pressure without condensation)43, absolute humidity (g m−3), accumulated precipitation (mm), atmospheric pressure (mbar), ultraviolet solar radiation (Mj m−2) and wind speed (km h−1). Additionally, we obtained data for other relevant environmental, demographic and geographic variables, those that were averaged and expressed as follows: air pollutant data, including particulate matter with aerodynamic diameter ≤ 10 μm (PM10) and ≤ 2.5 μm (PM2.5), obtained from a database of 30 air quality stations44; and city area (km−2), population size (ind), population density (ind km −2), latitude (absolute degrees), longitude (absolute degrees) and altitude (m a.s.l.), obtained from CONAF45 and IDE Chile46.

Statistical analyses

In order to analyze time lags in the transmission of the virus, three databases were built using environmental information with different time lags of contagion with respect to the response variable (IR), these being: (a) 0 days, (b) 7 days and (c) 14 days. Each database was subjected to an exploratory analysis, which allowed the identification of missing, influential or out of range data, and elimination of variables that were non-influential and/or highly correlated with others. This allowed reducing dimensionality by eliminating redundant information. For this, a correlation matrix was constructed and a boosting model was fitted using the VIF criterion47. For the final variables selection were modeled through extreme gradient boosting (XGBoost48). This model consists of a successful machine learning library based on a gradient boosting algorithm proposed by Chen48, which sequentially processes the data with a loss or cost function, minimizing the error iteration after iteration and increasing predictive power compared to other sequential tree models. We used VIFs to identify multicollinearity46 and data were normalized, which is important for machine-learning estimators. Dataset records were shuffled and split to 80% for the training and 20% for the test.

The model was trained with the original parameters adjusting the depth (max_depth) between 1 and 1049. Once the best iteration was identified, we proceeded to predict on the validation set. The model was tuned using hyperparameters (Table 2). We used a grid search on hyperparameters, parallelizing the search, with threefold cross-validation was carried out to find the best model based on, root mean square error (RMSE), R2 metrics and mean absolute error (MAE). The xgboost and caret libraries of the R software50 were used for the analyzes. Once the most important variables were selected, a spatial representation of each of them and climates was generated for each city, which was adapted from Sarricolea et al.36. For this, the existing spatial coverage in shape format in the national database of the Spatial Data Infrastructure IDE-Chile46 was used. For the management and analysis of spatial data, the ArcMap software version 10.8.1 (ESRI Inc., Redlands, California, USA) was used.