Geographic social inequalities in information-seeking response to the COVID-19 pandemic in China: longitudinal analysis of Baidu Index

The outbreak of the COVID-19 pandemic alarmed the public and initiated the uptake of preventive measures. However, the manner in which the public responded to these announcements, and whether individuals from different provinces responded similarly during the COVID-19 pandemic in China, remains largely unknown. We used an interrupted time-series analysis to examine the change in Baidu Search Index of selected COVID-19 related terms associated with the COVID-19 derived exposure variables. We analyzed the daily search index in Mainland China using segmented log-normal regressions with data from Jan 2017 to Mar 2021. In this longitudinal study of nearly one billion internet users, we found synchronous increases in COVID-19 related searches during the first wave of the COVID-19 pandemic and subsequent local outbreaks, irrespective of the location and severity of each outbreak. The most precipitous increase occurred in the week when most provinces activated their highest level of response to public health emergencies. Search interests increased more as Human Development Index (HDI) -an area level measure of socioeconomic status—increased. Searches on the index began to decline nationwide after the initiation of mass-scale lockdowns, but statistically significant increases continued to occur in conjunction with the report of major sporadic local outbreaks. The intense interest in COVID-19 related information at virtually the same time across different provinces indicates that the Chinese government utilizes multiple channels to keep the public informed of the pandemic. Regional socioeconomic status influenced search patterns.

Similarly, there was a 107-fold, 125-fold and 125-fold increase in search index between January 18 and January 25 2020, the period shortly after the official announcement of human-to-human transmission (HHT), among regions with low (RR = 106.8, 95% CI 100.1-114.0, p < 0.0001), middle (RR = 124.6, 95% CI 117.6-131.9, p < 0.0001) and high (RR = 125.3, 95% CI 116.5-134.8, p < 0.0001) HDI, respectively. The immediate increase in this short period among middle and high HDI regions were statistically significantly higher than the increase in low HDI regions (middle vs. low, ratio of RR = 1.16, p = 0.0004; high vs. low, ratio of RR = 1.17, p = 0.0012). From the peak of the search index on January 25 to June 10 2020, a 10%, 11% and 11% decrease per week was observed in the search index among regions with low (RR = 0.90, 95% CI 0.89-0.90, p < 0.0001), middle (RR = 0.89, 95% CI 0.88-0.89, p < 0.0001) and high (RR = 0.89, 95% CI 0.89-0.90, p < 0.0001) HDI, respectively (  Tibet  63  7  1386  983  2099   Yunnan  290  79  7939  7612  2642   Guizhou  224  87  7011  6531  3030   Gansu  175  74  5304  6348  2931   Qinghai  73  33  2465  1625  3277   Xinjiang  176  56  6108  7744  3370   Guangxi  303  119  7548  9863  2391   Sichuan  536  167  15, Figure 2 illustrated the heterogeneity in the immediate relative change in the search index following each pre-specified exposure across the country. Association between HDI, GNP per person, education, life expectancy and magnitude of change in the search index. The results from models where HDI or its component was coded as a continuous variable were consistent with findings from our main analysis. As shown in Table S1, the pre-pandemic trends in two provinces differing in HDI, GNPPP (Gross national product per person), education year or life expectancy by one standard deviation were similar (p > 0.1). The immediate relative increase in the search index in a province with one standard higher HDI was statistically higher (initial wave: ratio of RR = 1.09, p < 0.0001; HHT announcement: ratio of RR = 1.04 p = 0.0395; Beijing outbreak: ratio of RR = 1.06, p = 0.0090; Qingdao outbreak: ratio of RR = 1.04, p = 0.0324; Shijiazhuang outbreak: ratio of RR = 1.11, p < 0.0001). In contrast, the gradual decrease in the search index in a province with one standard deviation higher HDI after each exposure was either similar or greater. For each exposure, the difference associated with GNPPP, education year or life expectancy in the directions and magnitudes of both immediate and gradual effect across provinces was similar to the difference associated with HDI.

Discussion
The study used the Baidu search index related to COVID-19 at the subnational level to analyze the search volume of Chinese Internet users for COVID-19, which was used to reflect the level of public awareness of COVID-19, and the differences in levels of awareness of and proactive information-seeking response to COVID-19 in different regions. Our study found that, in January 2020, the outbreak of the Wuhan epidemic triggered an increase www.nature.com/scientificreports/ in search terms for COVID-19 among Internet users in different regions. In particular, this increasing trend was most sharply observed between January 18-25, 2020, a period when mass media (e.g. television, radio, newspaper and online media) reported the confirmation of human-to-human transmission of SARS-CoV-2, www.nature.com/scientificreports/ greatly increasing public awareness of the threat of the disease. This was reflected in a huge increase in search indices. In the later outbreaks, we also found that each subsequent outbreak in China reignited public interest in COVID-19, which resulted in the increasing search volume for COVID-19-related keywords. However, the subsequent increase in COVID-19 searches did not surpass the first search index apex, which may be explained by individuals having accumulated prior knowledge already and becoming more accustomed to subsequent COVID-19 outbreaks, as well as by the fact that subsequent outbreaks were less severe. When the Wuhan municipal government issued a notification about the existence of unknown respiratory syndrome at the end of December 2021, the public response was reminiscent of the fear caused by SARS in 2003, especially as little was known about this new pneumonia. On January 20, the confirmation of human-to-human transmission of COVID-19 was announced via mass media. After being informed of their susceptibility to COVID-19, the public across China rushed to seek related information online 24,25 .These increases happened in just 3 days, from 20 to 23 Jan. In contrast, global collective public attention to COVID-19 reached its peak on 12 March, following the declaration of PHEIC by the World Health Organization 26,27 . The surge of public collective attention to COVID-19 in China during the early stage of the outbreak could be attributed to governments at all levels mobilizing the whole society to contain the COVID-19 in China 28 . In addition, the first spike in search volume for COVID-19 related keywords occurred that same day across all provinces in China, which was different from the subnational patterns in the US where state-level search volume typically peaked at the time the first COVID-19 case was announced in the state 2,3,29,30 .
We further found that, after the first information-seeking peak, although there was an evident decline in the search interests in COVID-19 related words from February to April 2021, the public concern (reflected by the search interests) about the COVID-19 pandemic remained at a high level in every province and through the end of our study period. As Chinese government took the nationwide, stringent non-pharmaceutical interventions, China saw success in its initial containment of COVID-19, as the daily new local cases were under 10 from late March to late April. In the context of the zero-COVID policy in China, and few new cases of COVID-19, news of any new domestically occurring cases of COVID-19 in China generated a relatively large amount of media attention. For example, in Beijing in June 2020, a sporadic outbreak of 335 new COVID-19 cases (and no deaths) occurred due to imported frozen products 31   www.nature.com/scientificreports/ We found social inequalities in information-seeking behavior intensity within China. Studies have confirmed that deprived populations show relatively lower awareness of infectious diseases, including H1N1 and COVID-19 10,32,33 . In our findings, these inequalities are evidenced by the absolute change in Baidu Index volume as well as the speed with which peak search volume is achieved. Populations in areas with higher human development showed a higher volume of COVID-19 related searches, and their searches increased faster and maintained a relatively lower decline rate, which suggests not only the population living in HDIs areas have a faster response to COVID-19 but also maintained a heightened, more durable awareness about the COVID-19 epidemic. Lower awareness of COVID-19 may result in less attention paid to personal mitigation techniques and lower compliance with non-pharmaceutical interventions 5 , which together may put deprived populations at greater risk of contracting COVID-19. Due to the lower incidence and mortality of COVID-19 in China, it is difficult to analyze how social inequalities may have impact COVID-19 infections and related health outcomes in China. However, our analysis provides some evidence to support that there exist evident social inequalities in information-seeking reactions to and awareness of COVID-19 in China, potentially exacerbating existing inequalities in COVID-19 related physical and mental health comes for both short and long term 13,34 .
Our study is subject to several limitations. First, our study only attempts to use the analysis of internet users' information-seeking behavior to reflect public concern about COVID-19. Although Baidu search is the most commonly used search engine in China with the highest market share, our findings could not be generalized to people that do not have access to the Internet. Second, as a disproportionately higher fraction of individuals without access to the internet have low SES and lower level of health literacy 35 , we may have underestimated the inequalities in the information-seeking response among regions with different SES. Third, due to the lack of data, we were not able to examine the influence of mass media, which likely mediated internet searches, although the reverse is also possible (that is, internet searches could also mediate mass media exposure) 36,37 . Lastly, we were not able to explore how individuals reacted to a health crisis using more disaggregated, individual-level data, such as data from surveys. We were able to examine how patterns of information-seeking responses differed according to the area-level HDI metric and used this measure to generate a hypothesis about potential associations with respect to individual factors, including education and income.
We used Baidu search data to analyze the first wave of the COVID-19 epidemic in China and several subsequent small outbreaks and found that there was an unprecedented increase in public awareness of the COVID-19 epidemic in China, and that the several subsequent outbreaks also sparked intense concern among internet users across China. Changes in the patterns of search interest in COVID-19 in each province of China were nearly synchronous during the first wave of the COVID-19 pandemic and subsequent local outbreaks, irrespective of the location of the epicenter of each outbreak and the variation in pandemic severity across the country. However, social inequalities in public response and awareness of COVID-19 were apparent, with less search interest observed in less developed areas compared with developed areas.

Materials and methods
Data. Baidu is the most popular search engine in China. The Baidu index (BI) is measured as the weighted frequency of unique searches for a search keyword or phrase relative to total search volume on Baidu on a given day 38  The provincial daily confirmed COVID-19 cases were retrieved from official daily report 39 . The provincial-level human development index (HDI), an area-level measure of socioeconomic status, was retrieved from the China National Human Development Report 2019 to reflect regional-level SES 11,40 . A key advantage to examining an area-level measure in this context is its utility in providing evidence to help guide community-level interventions and policies. Other area-level measures by province, including the Gross National Product per person (GNPPP), the average number of years of education received by people ages 25 and older, and life expectancy at birth, which were used to calculate the HDI index, were extracted from the statistical yearbook and publicly available reports.
Our aim was to examine a series of three interrelated research questions, including (1) Did the Covid-19 outbreak lead to statistically significant increase in the Baidu Index of Covid-19 related terms? (2) What was the magnitude of the increases in searches compared to pre-Covid-19 forecasted trends, and how did these increases differ by regions with different social-economic development levels, and (3) Did the collective attention diminish toward pre-Covid-19 levels after the pandemic apex, and how did this differ according to the human development index (HDI)?
Ethical statements. This study was exempt from institutional review oversight since the data are publicly accessible and aggregated at population level. Methods were carried out in accordance with relevant guidelines and regulations.

Statistical analysis.
After the initial exploration of search indices over time, we adopted an interrupted time series design to examine the effects of Covid-19. The effect was modeled using a segmented log-normal regression parameterization [41][42][43][44] defining both pre-Covid trends (January 1 2017-December 30 2019), and distinct post-Covid periods that reflected different pandemic periods as experienced within China. Due to known large provincial-level heterogeneity in baseline levels as well as long-term trends, we employed mixed-effects models with random intercepts and random slopes over time, with individual provinces representing the ran-  42 . To adjust for observed seasonal and weekly cyclical patterns, we included fixed-effects of monthly and weekly indicator variables in all models. The Poisson model equation estimating the daily search index was expressed as follows: In the model, Index it denotes the value of the search index in province i at time t. HDI i is the HDI category (low, middle or high) for province i. β 0i represents the model intercept with both a fixed effect and province-level random effects, β 1i represents the underlying pre-Covid-19 secular trends with both a fixed effect and provincelevel random effects. The five distinct indicator variables (Covid, Covid1, Covid2, Covid3 and Covid4) are used to define the exposures or intervals: 1) December 31 2019, the estimated start of the first Covid-19 wave; 2) 18 January 2020 (right before the official announcement of human-to-human transmission via mass media) to Jan 25 January 2020 (shortly after the lockdown and the estimated peak of daily search index in the initial Covid-19 wave); 3) a second outbreak in Beijing starting on June 11; 4) the outbreak in Qingdao starting on October 12 2020; 5) the outbreak in Shijiazhuang starting on January 3 2021. T is the time (days) elapsed since the start of the study, and T 1 , T 2 , T 3 , and T 4 represent the days since the estimated peak (25 January 2020, June 17 2020, October 12 2020 and January 7 2021) of the daily search index associated with each distinct exposure, respectively. We interacted the main effect terms with strata of HDI categories, examining the extent to which the change in search index associated with each exposure differed by area-level socioeconomic status. Month and Day are individual dummy variables indexing month of the year using the month of January as the reference category, and the day of the week using Friday as the reference category respectively. An AR (1) correlation structure was used to accommodate autocorrelation in residual errors. In order to estimate the association of each component of HDI with the change of search index, we replaced HDI i in the equation by standardized HDI (continuous variable), GNPPP, years of education or life expectancy and repeated all the analyses.
We employed a linear mixed-model with logarithmic transformation of the independent variable and a normal residual distribution 45 . We used a mixed-effects log-normal model rather than a negative bionomial or Poisson model for three reasons. First, an attempt to run these generalized linear models with log link (e.g. Poisson and negative binomial model) failed to converge without simplifications such as the elimination of an AR(1) correlation structure in residual errors and the elimination of provincial-level random slopes. Second, mixed effect log-normal models provided a better fit to the data patterns than the fixed-effect log-normal models and generalized linear models judged by Akaike's Information Criteria (AIC) and Bayesian Information Criteria (BIC). Third, there was no evidence of issues with heteroscedasticity in the residuals departure from a normal distribution in the error distribution when using the log normal model.
All analyses were conducted in R-version-4.0.2 using data obtained March 31, 2021. A two-sided alpha value of 0.05 indicated statistical significance. In order to maintain a family-wise alpha (type I error rate) at 0.05 over multiple comparisons, the Bonferroni correction was employed for predefined exposures for each of the 3 HDI categories. This defined a test-specific significance level of 0.05/(number of tests in analysis-rank of p value from lowest to highest + 1) 46 . This study is reported as per the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines for cohort studies.

Data availability
The datasets used and/or analysed during the current study available from the corresponding authors on reasonable request.