Introduction

Since December 2019, an increasing number of pneumonia cases caused by a novel coronavirus (SARS-CoV-2) have been identified in Wuhan, China1,2. This new pathogen has exhibited high human-to-human transmissibility with approximately 16,819,944 confirmed cases of COVID-19 and 662,000 deaths reported globally as of July 29, 2020.

On January 23, 2020, Wuhan—a city in China with 11 million residents—was forced to shut down both outbound and inbound traffic in an effort to contain the COVID-19 outbreak ahead of the Lunar New Year. However, it is estimated that more than five million people had already left the city before the lockdown3, which has led to the rapid spread of COVID-19 within and beyond Wuhan.

In addition to population mobility and human-to-human contact, environmental factors such as absolute humidity (defined as the water content in ambient air) and temperature, have been found to be strong environmental determinants of transmissions for some viral pathogens4,5. For example, influenza viruses survive longer on surfaces or in droplets in cold and dry air, thus increasing the likelihood of subsequent transmission. For COVID-19, a recent study found that higher temperatures may have led to higher transmission in 122 cities in China, concluding that there was no evidence supporting the hypothesis that case counts of COVID-19 would decline when temperatures increase6. In contrast, another study showed that higher transmission was observed in colder places when analyzing data from 429 cities across the world, suggesting that temperature could potentially impact COVID-19 transmission7. A third study found that warm and dry weather was favorable to the survival of the virus8 whereas a fourth determined that transmission would decrease with the arrival of spring and summer9. As discussed in a recent paper10, quantifying the relationship between COVID-19 transmission and weather variables is a challenging task for multiple reasons. First, characterizing the time evolution of COVID-19 transmission from the available datasets produced by multiple public health agencies can yield very different temporal outbreak trajectories. Second, estimating the instantaneous transmission rate, Rt, using the dates of report as opposed to using the dates of onset of symptoms will invariably lead to significantly different results. Third, the choice of methods to calculate Rt using for example Cori’s method or Wallinga and Teunis’ method, will lead to temporal shifts that complicate establishing causal relationships between weather and transmission11. Fourth, non-pharmaceutical interventions to contain COVID-19 in China since January 23, 2020 significantly reduced the country-wide disease duration and outdoor transmission12; the environmental impact on transmission may have been eclipsed as a consequence. Finally, differences in reporting practices across regions may complicate any efforts to compare relationships between weather and transmission from one location to another.

Despite these challenges and inconsistent conclusions from research on this topic to date, it is important to propose alternative methodologies that provide a complementary understanding of the effects of environmental factors on the ongoing outbreak to support decision-making pertaining to disease control. This is especially true for locations where the risk of transmission may have been underestimated, such as humid and warm places.

Our contribution

Here, we propose a methodology that can be implemented in real-time during the early phase of an outbreak to examine variability in environmental factors, mobility, and transmission of COVID-19 across provinces and cities in China. We show that the observed spatial patterns of COVID-19 transmission are not explained by ambient temperature, absolute humidity or human mobility alone. Our findings do not support the hypothesis that high absolute humidity in warmer environments may limit the survival and transmission of this new virus.

Data and methods

Epidemiological data

To conduct our analysis, we collected epidemiological data from the Johns Hopkins Center for Systems Science and Engineering website13. Incidence data were collected from various sources, including the World Health Organization (WHO); U.S. Centers for Disease Control and Prevention (CDC); China CDC; European CDC; the Chinese National Health Center (NHC); as well as DXY, a Chinese website that aggregates NHC and local China CDC situation reports in near real-time. Daily cumulative confirmed incidence data were collected for each province in China from January 22, 2020 to February 26, 2020. We also obtained epidemiological data for other affected countries, including Iran, Italy, Singapore, Japan, and South Korea and 345 cities in China.

Estimation of a proxy for the reproductive number

Based on the cumulative incidence data for each province, city or country, we estimated a proxy for the reproductive number R in a collection of 5-, 6- and 7-day intervals14. R is a measure of potential disease transmissibility defined as the average number of people a case infects before it recovers or dies. Our proxy for R, designated as Rproxy, is a constant that maps cases occurring from time (t) to time (t + d) onto cases reported from time (t + d) to time (t + 2d); where d is an approximation of the serial interval (i.e., the number of days between successive cases in a chain of disease transmission). For multiple time points, t, we obtained values of Rproxy(t,d), given by:

$${R}_{proxy}\left(t,d\right)=\frac{C\left(t+2d\right)-C\left(t+d\right)}{C\left(t+d\right)-C\left(t\right)}$$

where C is the cumulative case count up to time t, and the values of d range from [5 to 7]. Our measure is considered only a proxy for R because it does not use details of the (currently imprecise definition of the) serial interval distribution, but instead, simply calculates the multiplicative increase in the number of incident cases over approximately one serial interval. Such proxies are at least approximately monotonically related to the true reproductive number and cross 1 when the true reproductive number crosses 115, i.e. increases in our proxy typically signal increases in R. After computing these proxy values over a variety of subsequent moving time windows, for each serial interval (5, 6 and 7 days), a mean value was obtained and used as our estimated reproductive number R for each province, city, and country.

Time windows

Our study was conducted from January 22, 2020 to February 26, 2020 to make sure that there was COVID-19 activity across all the locations. Indeed, the main outbreaks in Chinese provinces took place from the beginning of January to the end of February. In addition, to characterize the temporal evolution of the COVID-19 outbreak (a large decrease in transmission after the closure of Wuhan and a subsequent flattening of the epidemic curve), the reproductive number Rproxy was calculated for two different time periods. The first one, τ1, was from January 22, 2020 to February 8, 2020 and the second one, τ2, was from February 9, 2020 to February 26, 2020. In our study, the reproductive numbers computed on the first and second time periods are labeled R0τ1 and R0τ2, respectively.

Weather data

All meteorological data for this study were taken from the ERA5 reanalysis, a state-of-the-art data product produced at the European Centre for Medium-Range Weather Forecasts16,17. ERA5 is generated by using a vast range of meteorological observations to constrain a physics-based numerical weather prediction model. This procedure, referred to by atmospheric scientists as data assimilation, yields a globally complete gridded data set including many different meteorological variables. Time resolution of ERA5 is quite high (1 h) and it is also frequently updated (preliminary ERA5 data are available 5 days behind real time), making it useful for studies of rapidly evolving disease outbreaks18. Furthermore, a conceptually similar but much less sophisticated data product (the National Centers for Environmental Prediction-National Center for Atmospheric Research reanalysis19) has been found useful for studies of influenza epidemics5.

We obtained relevant ERA5 data at a spatial resolution of 0.25° (~ 28 km at the equator). We represented weather conditions in each city of interest by those in the ERA5 grid box containing the city. Because we assumed that the majority of disease incidence for each province occurs in or near the capital due to increased population density in these areas, we chose to represent each province’s weather conditions by those in the ERA5 grid box containing the provincial capital. Near-surface air temperature, used in this study, is one of the standard ERA5 variables. Absolute humidity (more specifically, near-surface water vapor density) is not one of the standard ERA5 output variables. Instead, it must be computed from variables that are available, namely near-surface air temperature (T2) and near-surface dew point temperature (Td) (see supplementary material for more details). We produced hourly time series of temperature and humidity and then computed time mean absolute humidities and temperatures over January 17–31, 2020 and February 1–15, 2020, for comparison to τ1 and τ2 Rproxy data, respectively.

Human mobility data

We obtained mobility data made publicly available by the Chinese Internet search engine Baidu20. From the full origin–destination matrix for each day, we created a dataset to get the percentage of people traveling from Wuhan and going to the different Chinese provinces from January 1, 2020 to January 22, 2020 (i.e., before the mandated lockdown in Wuhan.)

Data analysis

Given the potential noise contained in the reported case counts, we tested the robustness of our findings by gradually removing provinces and cities for which their data was deemed too noisy or missing from our analysis. This was done in three subsequent filtering steps as follows. First, we included all provinces and cities where Rproxy could be properly calculated (i.e. enough cases were reported). Second, we removed provinces where mobility data was not available. Finally, we removed provinces and cities where the values of Rproxy were unrealistically high (due perhaps to reporting biases), specifically above 3. The latter filter was used to further remove potential noisy values that would affect our analysis and responding to the fact that the World Health Organization has estimated that R values range from 2 to 2.5. For country-level transmission, we did not conduct any statistical analysis due to the extremely noisy values of Rproxy.

Human mobility as a predictor of the reproductive number

To disentangle if our reproductive number estimates could be explained by importation of cases from Wuhan, Hubei, alone; and if they could be interpreted as indicators of local transmission, we formulated a linear model with the local Rproxy as the response variable, and human mobility as a predictor at the province level. Specifically, we used mobility data before the closure of Wuhan (i.e. from January 1, 2020 to January 22, 2020) to explain \(\mathrm{R}{0}_{\uptau_1}\).

$$\mathrm{R}{0}_{\uptau_1}\left(j\right)={\upbeta }_{0}+{\upbeta }_{1}{X}_{mobility}\left(j\right)+\upepsilon \left(j\right)$$

where \(\mathrm{R}{0}_{\uptau_1}\)(j) is the proxy for the reproductive number for the province j during the immediate time-period of two weeks after Wuhan's lockdown; and \({\mathrm{X}}_{\mathrm{mobility}}\) is the percentage of people traveling from Wuhan and \(\upepsilon\) \(\sim \mathcal{N}\left(0,\hspace{0.17em}1\right)\) residuals of the regression.

Relationship between reproductive number and temperature

We used a Loess regression to visually represent the relationship between the reproductive number for each province and temperature (Fig. 1). To identify the statistical relevance of this relationship we implemented a linear model using the log of the local reproductive number Rproxy as our response variable, and temperature as predictor and log transformation was employed to improve gaussianity (Supplementary Figure S1). The linear model was computed for both time periods described above:

Figure 1
figure 1

Visualization of the relationship between COVID-19 transmission as captured by Rproxy and temperature and humidity. The data points on the scatter plot represent the value of Rproxy (with its associated 87% confidence intervals displayed as vertical lines, obtained from the collection of Rproxy calculated in subsequent time windows of length d for each location) as a function of temperature and humidity. The black line corresponds to a Loess regression aimed at capturing the relationship between Rproxyand temperature and humidity. In addition, the color intensity (orange) of each data point shows the size of the outbreak in each location, as captured by the log of cumulative case counts.

$$\mathrm{log}\left({\mathrm{R}}_{\mathrm{proxy}}\left(\mathrm{j}\right)\right)={\beta {^{\prime}}}_{0}+{\beta }_{2}{\mathrm{X}}_{\mathrm{temperature}}\left(\mathrm{j}\right)+\mathrm{\epsilon {^{\prime}}}\left(\mathrm{j}\right)$$

Depending on the time period explained, Rproxy corresponds to \(\mathrm{R}{0}_{\uptau_1}\) or \(\mathrm{R}{0}_{\uptau_2}\) for the province and the city-level; \({\mathrm{X}}_{\mathrm{temperature}}\) corresponds to the temperature for the first and second time periods.

Relationship between reproductive number and absolute humidity

As for temperature, we conducted the same analysis for absolute humidity. The linear model was:

$$\mathrm{log}\left({\mathrm{R}}_{\mathrm{proxy}}\left(\mathrm{j}\right)\right)={\beta {^{\prime}}{^{\prime}}}_{0}+{\beta }_{3}{\mathrm{X}}_{\mathrm{abs humidity}}\left(\mathrm{j}\right)+\mathrm{\epsilon {^{\prime}}}{^{\prime}}\left(\mathrm{j}\right)$$

where \({\mathrm{X}}_{\mathrm{abs humidity}}\) corresponds to the absolute humidity for the first and second time periods.

Results

Reproductive number proxy

In both time periods, \(\uptau_1\) and \(\uptau_2\), our estimates of Rproxy for each province within China, appeared to be consistent across the range of serial intervals we analyzed (Fig. 1). In the first time-period, most regions have a Rproxy estimate well above 1, signaling sustained disease transmission. Rproxy estimates across provinces decreased dramatically on the second time-period, many below 1, likely as a response to the multiple (non-pharmaceutical) interventions implemented by Chinese authorities.

Data analysis (filtering)

In the first step of our analysis, the provinces of Tibet, Qinghai and Macau were removed due to the low number of reported COVID-19 cases there. Low number of cases (and multiple zeros) led to invalid calculations (NaN) of Rproxy. In the second step, we removed 3 provinces given that no mobility data were available: Tibet, Hong Kong and Inner Mongolia. Finally, 5 provinces were removed: Guizhou, Hubei, Heilongjiang, Jilin and Shandong given the unrealistically high value of their Rproxy (3.92, 3.19, 3.32, 3.57, and 4.45 respectively). At city level, 175 cities were removed due to the low number of cases (first filter) and 23 cities were removed because of the high value of their Rproxy (third filter). Finally, the values of Rproxy for countries are shown for reference: Iran (\(\mathrm{R}{0}_{\uptau_1}=0\) and \(\mathrm{R}{0}_{\uptau_2}=34.00\)), Italy (\(\mathrm{R}{0}_{\uptau_1}=0\) and \(\mathrm{R}{0}_{\uptau_2}=107.2\)), Singapore (\(\mathrm{R}{0}_{\uptau_1}=1.85\) and \(\mathrm{R}{0}_{\uptau_2}=0.39\)), Japan (\(\mathrm{R}{0}_{\uptau_1}=1.84\) and \(\mathrm{R}{0}_{\uptau_2}=2.70\)), and South Korea (\(\mathrm{R}{0}_{\uptau_1}=3.11\) and \(\mathrm{R}{0}_{\uptau_2}=196.97\)).

Relationship with mobility

Because Wuhan (provincial capital of Hubei) was the origin of the COVID-19 outbreak, and exported cases could only be calculated in the rest of the provinces, we excluded Hubei from our mobility analysis. As shown in Tables 1 and 2, identifying the influence of mobility on Rproxy can only be done after the third step of filtering. Human mobility (prior to Wuhan's lockdown) did not appear associated with Rproxy across Chinese provinces during time-period \(\uptau_1\) (p value = 0.93). However, in the same time-period, once we excluded Rproxy values above 3 (third step of filtering), mobility was found to be associated with Rproxy (p value = 0.01).

Table 1 Relationship between reproductive number for the first time period \(\mathrm{R}{0}_{\uptau_1}\), and mobility with the second step of filtering.
Table 2 Relationship between reproductive number for the first time period \(\mathrm{R}{0}_{\uptau_1}\), and mobility with the third step of filtering.

Relationship with temperature

Figure 1 is a visualization of the relationship between COVID-19 transmission as captured by Rproxy and temperature and humidity. The data points on the scatter plot represent the value of Rproxy (with its associated confidence interval) as a function of temperature and humidity. The black line corresponds to a Loess regression aimed at capturing the relationship between Rproxy and temperature and humidity. Specifically, for the first time period, we can see that higher temperatures lead to lower rates of transmission. In addition, the color intensity (orange) of each data point shows the size of the outbreak in each location, as captured by the log of cumulative case counts.

Regarding the results of the linear regression models, after the first step of filtering, for the time-period \(\uptau_1\), temperature appeared to be associated with Rproxy at the 94% confidence level (Table 3). Specifically, temperature showed a negative relationship, indicating that higher temperatures appeared to have lower transmission (Fig. 2). After the two additional steps of filtering, the association between temperature and Rproxy became weaker or non-significant (with p values equal to 0.111 and 0.857 respectively; Tables 4 and 5). Weak to non-significant associations were observed when we conducted our analysis for the second time-period \(\uptau_2\), with P values ranging from 0.118 to 0.700 (Tables 6, 7, 8). At the city-level in China the temperature appeared to be associated to Rproxy for the first time-period and after removing cities with low number of cases (p value = 0.01; Supplementary Table S1). After removing Rproxy above 3, the temperature was no longer associated with Rproxy, with a p value equal to 0.83 (Supplementary Table S2). No associations were observed for the city-level analysis for the second time-period, with p values equal to 0.32 and 0.23 after the two steps of filtering (Supplementary Tables S3, S4).

Table 3 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_1}\)) and temperature with the first step of filtering.
Figure 2
figure 2

Temperature in each provincial capital vs. COVID-19 Rproxy estimate (calculated for the first time period). The size and color of each pin indicate cumulative cases per province and Rproxy range, respectively. (Map obtained with ArcMap, https://desktop.arcgis.com/en/arcmap/ version 10.2).

Table 4 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_1}\)) and temperature with the second step of filtering.
Table 5 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_1}\)) and temperature with the third step of filtering.
Table 6 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_2}\)) and temperature with the first step of filtering.
Table 7 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_2}\)) and temperature with the second step of filtering.
Table 8 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_2}\)) and temperature with the third step of filtering.

Relationship with absolute humidity

In all steps of filtering at the province-level, and for both time periods, \(\uptau_1\) and \(\uptau_2\), absolute humidity was not associated to Rproxy, with P values ranging between 0.161 and 0.922 (Tables 9, 10, 11, 12, 13, 14, 15). This can also be observed in Fig. 1, where the black curve (corresponding to the Loess regression) is relatively flat. Meanwhile, Fig. 3 allows us to visualize the values of Rproxy and humidity across regions. For cities, for time-period \(\uptau_1\), and after the first step of filtering, absolute humidity appeared to be associate with Rproxy with a p value equal to 0.004 (Supplementary Table S5). Specifically, absolute humidity showed a negative relationship, indicating that locations with higher absolute humidity experienced lower transmission. Nevertheless, after the third step of filtering, absolute humidity was not found to be associated with Rproxy, with a p value equal to 0.64 (Table S6). For the second time period \(\uptau_2\), no associations were found either, with p values equal to 0.95 and 0.87 after the two steps of filtering, respectively (Tables S7, S8).

Table 9 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_1}\)) and absolute humidity with the first step of filtering.
Table 10 Relationship \(\mathrm{log}(\mathrm{R}{0}_{\uptau_1}\)) and absolute humidity with the second step of filtering.
Table 11 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_1}\)), and absolute humidity with the third step of filtering.
Table 12 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_2}\)) and absolute humidity with the first step of filtering.
Table 13 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_2}\)) and absolute humidity with the second step of filtering.
Table 14 Relationship between \(\mathrm{log}(\mathrm{R}{0}_{\uptau_2}\)), and absolute humidity with the third step of filtering.
Table 15 Summary of the principal results (P value, R2) of the linear regressions.
Figure 3
figure 3

Absolute humidity in each provincial capital vs. Rproxy estimate (calculated for the first time period). The size and color of each pin indicate cumulative cases per province and Rproxy range, respectively. (Map obtained with ArcMap, https://desktop.arcgis.com/en/arcmap/ version 10.2).

Discussion

Ambient temperature appears to be associated to COVID-19 transmission (as captured by our proxy of R) during the first time-period (January 22, 2020–February 8, 2020) in both spatial resolutions and in the absence of any data filtering. Specifically, temperature showed a negative relationship, indicating that higher temperatures appeared to have lower COVID-19 transmission. These results were not robust to filtering techniques aimed at removing noisy values such as unrealistically high values of Rproxy (more than 3). In an effort to identify if transmission rates could be explained by the rate of case importations at the province-level, we analyzed if mobility from Wuhan to each province could explain the spatial variability of Rproxy during the first time-period. Our results showed no associations between mobility and Rproxy in the absence of data filtering but showed that Rproxy could be explained by mobility when removing values of Rproxy larger than 3. Finally, our analysis suggests that absolute humidity was not robustly associated with Rproxy, but these results need to be interpreted carefully given the monotonic functional relationship between humidity and temperature (Clausius–Clapeyron relation). In other words, if temperature were associated to COVID-19 transmission, very likely absolute humidity would play a role.

Limitations

Our estimates of the observed Rproxy across locations were calculated using available and likely incomplete reported case count data, with date of reporting, rather than date of onset, which adds noise to the estimation. In addition, the relatively short time length of the current outbreak, combined with imperfect daily reporting practices, make our results vulnerable to changes as more data becomes available. We have assumed that travel limitations and other containment interventions have been implemented consistently across provinces and have had similar impacts (thus population mixing and contact rates are assumed to be comparable), and have ignored the fact that different places may have different reporting practices. Further improvements could incorporate data augmentation techniques that may be able to produce historical time series with likely estimates of case counts based on onset of disease rather than reporting dates. This, along with more detailed estimates of the serial interval distribution, could yield more realistic estimates of R. In addition, while the low R2 values from our models show that each individual variable is not enough to explain the variability of COVID-19 transmission rate, we considered that finding statistically significant relationships could help us achieve our goal. In fact, if the goal were to design a model to explain the variance of Rt one would likely require more input variables, for example the density of population in each area, people’s behaviour (regarding mask-wearing adoption, for example) or socio economic factors, etc. Future studies should incorporate all these variables to further characterize transmission. Finally, further experimental work needs to be conducted to better understand the mechanisms of transmission for COVID-19. Mechanistic understanding of transmission could lead to a coherent justification of our findings.

Conclusion

Despite the above limitations, our early and near-real-time analysis regarding the impact of environmental factors on COVID-19 transmission in China could provide useful implications for policymakers and the public worldwide. Sustained transmission and rapid growth of cases were observed over a range of temperatures and humidity conditions ranging from cold and dry provinces in China, such as Jilin and Heilongjiang, to tropical locations, such as Guangxi and Taiwan during the first time-period (τ1, from January 22 to February 8, 2020). Our results show that weather alone cannot explain, in a robust way, the variability of the reproductive number in Chinese provinces or cities. Moreover, drastic reductions in transmission were observed during the second half of February, likely due to the strict non-pharmaceutical interventions imposed across China. In addition, we can see that all these findings have been confirmed in these past few months. Further studies on the effects of environmental factors on COVID-19 will be possible as more data is collected in multiple affected geographies during this COVID-19 outbreak.