Introduction

Conventional epidemiological data resources such as disease registries or national health and morbidity surveys that explore temporal or geographical variations across populations are often dependent on large scale community-based surveillance studies. These data are regarded as the “gold-standard” for epidemiology as they are capable of yielding observations based on geographical gradients or stratifications by age or gender1. Such data resources are retrospective in nature, resource intensive and have lag periods of data availability2, limiting capabilities for urgent analytical inferences or evidence synthesis for public health policy implementations. Another limitation of such conventional approaches is the inability to monitor real-time population’s health information-seeking behaviors (HISB) on emerging threats of diseases.

With the rise of Health Web 2.0, Population Health Data Science (PHDS) has emerged as an art of science that transforms real-time data into actionable knowledge that informs, influences and optimizes decision making promptly3. PHDS integrates public health medicine, robust medical statistics, health and behavioral sciences within human centered designs for knowledge integration3. The adoption of PHDS connotes the era of information overload within big health data, allowing real-time HISB analysis of stroke to be conducted through borderless internet connectivity. Within these applications, Google Trends has been regarded as the best analyzer for real-time HISB analysis4.

Digital footprints left by online internet users potentially serve as proxies for monitoring disease activities and HISB at the community level, capable of providing real-time valuable insights into temporal and spatial trends of diseases being studied5. The bulk of literature has explored online HISB for a variety of diseases. These include neurological disorders such as multiple sclerosis6 and status epilepticus7, rheumatic diseases like systemic lupus erythematous (SLE)8, mental health conditions like suicidal thoughts9 and non-suicidal self-injury10, non-communicable diseases and risk factors such as cardiovascular disorders2, cancer11 and non-cigarette tobacco use12, and infectious diseases like AIDS13, Ebola14 and influenza15,16. The monitoring and analysis of internet data is conceptualized as infodemiology, providing real-time data, tackling time lag for data analysis and forecasting of disease patterns.

Studies conducted till date have mostly used and analyzed single datasets of either internet multi-timeline data or conventional surveillance data (e.g. disease registries) separately, limiting the potentials to explore correlations with real-time HISB and incidence of diseases. The epidemiological trends of stroke occurrences across populations have been influenced by geographic variations, demographics and socio-economic attributes1. These trends were speculated to be influenced by weather, temperature or seasonal variations in some studies17,18,19. The current study was the first in Asia, from the Malaysian perspective that aimed to determine the trends, correlations, weather and geographic variations of stroke, and to subsequently yield a forecasted model of real-time HISB and stroke incidences in the country for the next 3 years.

Review of literature

Global epidemiology of stroke

Stroke is one of the leading causes of mortality and disability worldwide20,21. In 2016, stroke accounted for 116.4 million disability-adjusted life years (DALYs) and 5.5 million deaths globally22. There were approximately 80.1 million stroke cases reported in 2016 that afflicted 41.1 million women and 39 million men respectively22. Between 1990 and 2017, there was an 11.3% decrease in age-standardized stroke incidence rate worldwide (150.5 per 100,000 people in 2017)23. But this scenario was accompanied by an overall 3.1% increase in age-standardized stroke prevalence rate (1,300.6 per 100,000 people in 2017), with 33.4% decrease in age-standardized stroke mortality rate (80.5 per 100,000 people in 2017) in the same period of time23. Escalated trends of age-standardized stroke incidence rates were observed mostly in middle-income countries23. Regional differences found that the incidence of stroke was the highest in East Asia, followed by the Eastern European region and the lowest in Central Latin America24.

Stroke as a public health issue in Malaysia

The epidemiological literature of stroke in Malaysia was scarce until the implementation of the National Neurology Registry (NNEUR) of Malaysia in 200925,26. Malaysia witnessed an escalating incidence of stroke cases, being the third most common cause of mortality and topped the nation’s disability rate27. In 2016 alone, stroke accounted for 11,284 cases, mostly affecting men (55%) and those aged 60 years or older (60%)26. Age-standardized stroke mortality rates were 103 per 100,000 in men and 97 per 100,000 in women25. Significant functional disabilities and psychiatric morbidities posed substantial burden to patients, caregivers, healthcare systems and providers25, thus escalating high economic burden28.

Google Trends related studies

Google Trends has been valuable to explore trends, seasonality and correlations for a variety of neurological and non-communicable diseases. Walcott et al29 used Google data to determine the prevalence of stroke in the USA. They found that disease-specific search queries related to stroke correlated well with geographical differences across states and the correlation model provided a metric to evaluate health disparities29. Senecal et al30 hypothesized the importance of online search symptoms for early identification of cardiovascular diseases. They found correlation of online symptom of chest pain with coronary heart disease epidemiology30. Kumar et al2 were eager to determine if temporal and geographical interests in seeking cardiovascular disease (CVD) information online would follow a seasonal or geographical pattern similar to those observed in real-world data. They performed an ecological correlation study by using online search queries from Google Trends and age-adjusted estimates of mortality associated with heart disease, heart failure and stroke per 100,000 persons. They found that query volumes followed strong seasonal patterns and yielded moderate to strong positive correlations between state-level search query volumes and mortality rates2. Bragazzi6 explored internet usage data for seeking health materials for self-care and self-management purposes in monitoring multiple sclerosis using Google Trends. The study concluded that Google Trends was a reliable tool for monitoring multiple sclerosis with significant correlations found between clinical manifestations and treatment across different states in Italy6.

Motivations of the current study

As conventional epidemiological data collection and analysis is labor intensive and time consuming, Google Trends has offered an alternative to provide real-time data. Such alternatives, being part of PHDS has given an opportunity to public health advocates to yield immediate evidence for crafting disease control and prevention strategies. The diversity of subjects that Google Trends could explore for examining changes in search interest overtime and the usefulness of this tool in assessing human behavior is evident that online search traffic data analytics being correlated with conventional epidemiological data will be valuable to explore, predict and forecast health behavioral changes amongst populations4. Given the high prevalence of stroke in Malaysia in recent years, it is timely to offer this novel epidemiological surveillance data analytics tool at the population level for faster evidence synthesis.

Methods

Study population and design

This countrywide ecological correlation and time series study was conducted between January 2004 to March 2019 by employing digital and spatial epidemiological analytics for the study of stroke HISB and incidence of stroke among the Malaysian population. Digital epidemiology adopted concepts of “infodemiology” and “infoveillance” that was recently coined as the “new public health” to study online HISB of health related conditions and disease patterns, distributions, trends, variations, and correlations by using novel internet data streams31. While “infodemiology” has been defined as the science of distribution and determinants of information in an electronic medium, specifically the internet (Google Trends) with the ultimate aim to inform public health policy, “infoveillance” has been conceptualized as the longitudinal tracking of “infodemiology” metrics for surveillance and trend analysis. Spatial epidemiological analytics that utilized geographic information systems (GIS) was employed to understand the distribution of HISB and stroke incidence across regions, cities and states in Malaysia.

Data source

Online HISB of stroke was retrieved from Google Trends multi-timeline search queries data. Google Trends, an online tracking system of internet search volumes that merged with Google Insights for Search (Google Inc.)32, was searched between years 2004 until 31st March 2019 for the terms “stroke,” “strok (Malay),” “angin-ahmar (Malay),” “cerebrovascular accident,” and “CVA” in Malaysia. Related domains of “stroke and organ affected,” “stroke types,” ‘stroke symptoms,” “stroke signs,” “stroke risk factors,” “stroke treatment” and “stroke prevention” were also explored. Google Trends automates normalized data for the overall number of searches and provides values as relative search volumes (on a scale from 0 to 100; value 0 does not necessarily indicate no searches, but rather indicates very low amount of search volumes that are not included in the results) in order to compare variations of different search terms across geographical settings and periods. This approach has been applied and validated. All queries and search volumes related to stroke were downloaded via .csv file format.

Conventional surveillance data of actual stroke counts in the country was obtained from the NNEUR, a prospective, multicenter hospital-based registry that captures data of acute stroke patients admitted across Ministry of Health Malaysia hospitals nationwide. The registry is an on-going effort funded by the government of Malaysia and consists of fifteen participating stroke hospitals across the Peninsular Malaysia and Borneo region. The registry aims to capture a comprehensive epidemiological surveillance data of stroke in the country. NNEUR participating stroke hospitals enroll confirmed hospitalized stroke patients within two weeks of symptoms onset26, 33. Actual stroke counts that were available between 2012 and March 2019 across states were retrieved and tabulated.

Procedure

The procedure of data retrieval, exploration and analysis was conducted based on the validated methodological framework proposed by Mavragani and Ochoa34. It includes four major steps as follows:

  1. I.

    Step 1: Measurement of online search interests (data overview) We explore online interest for different terms or keywords (up to five) in the same region for the same period such as “stroke,” “strok,” “angin ahmar,” “CVA,” and “cerebrovascular accident” in Malaysia from January 01, 2004, to March 31, 2019. Related domains of stroke were also explored. As our search terms may encounter misspellings in English but correct in Malay (for e.g. “stroke” in English, but “strok” in Malay is equally correct for the language, but considered misspelled in English), we utilized the “+ feature during searches to aggregate the result volumes without eliminating it.

  2. II.

    Step 2: Explore seasonality or variations This step aimed to detect variations or seasonality of web-based interest. It forms the platform if the data is suitable to proceed on examining relations between online search interests and actual events or disease cases.

  3. III.

    Step 3: Finding correlations This step correlates web-based queries among them or with official actual data cases. The official actual stroke count data in Malaysia was obtained from the NNEUR.

  4. IV.

    Step 4: Predict and forecast This final step aimed to predict and forecast stroke HISB with future incidence of stroke.

Statistical methods

Statistical analysis was conducted using R version 3.5.135 and IBM SPSS Statistics version 22.036. We conducted time series analytics to explore trends of HISB of stroke in Malaysia. Seasonality over time, month and weather variations, coupled with top search queries and flux volumes was determined through Google Trends multi-timeline data. To test for differences in mean search volumes across weather and month, we used linear regression analysis with season or month as a categorical predictor, with the 95% CIs for percentage change being bootstrapped with 1,000 random samples.

Correlograms to check for autocorrelation and adjusted partial autocorrelation significance for time series was determined using Wessa Time Series37. In addition, we determined randomness of data through series of point time lags that reached zero or near-zero in yielded correlograms. The degradation of points to near zero, either rapidly or slowly determines stationary or non-stationary of the data in the correlograms.

Spatial epidemiology of choropleth maps were yielded through merged data from the Global Administrative Database (GADM-Level 1 Data—Malaysia) that was available from the Center of Spatial Sciences38. A list of stroke attributes and related terms of their flux volumes were correlated with their hit search data using Pearson’s correlation coefficient analysis. Pearson’s correlation analysis is the measure of linear correlation between two continuous variables39,40; in this study “stroke” search term as the dependent variable and stroke-related terms as independent variables retrieved from Google Trends search queries. The analysis yields Pearson’s correlation coefficient (r) and ranges between − 1 and 139,40. A correlation of − 1 indicates that the two variables are negatively linearly related, a correlation of 0 means that the two variables do not have any linear relations, while a correlation coefficient of 1 means that two variables are perfectly positively linearly related40,41. Consistent with these statistical theories, we followed trends of recent time series studies that utilized Google Trends to explore correlations within search terms or between search terms and counts data of different diseases by employing Pearson’s correlation analyses2,7,8,10.

Subsequently, we performed an ecological correlation analysis42,43 to test whether search volumes were correlated with the actual incidence of stroke at state and country level using Pearson’s correlation coefficient analysis. Significance level was set at two tails (P < 0.05).

Finally, we forecasted a predictive model using exponential smoothing of Winters additive method to yield Malaysia’s Stroke 2.0, that aims to forecast HISB and projected incidence of stroke within the next 3 years. Forecasting and modelling methods in principle have two general approaches—exponential smoothing or moving averages44. On what determines the usability on one of those two approaches are the conditions of stationary and seasonality of the time series data44,45,46. Moving averages are highly appreciable in stationary time series44. As our time series data showed seasonality trends and was non-stationary, we opted for exponential smoothing44. Literature has identified that Holt-Winters exponential smoothing (a stochastic procedure of observations during the time) is better and more widely used due to its flexibility in seasonal variations45,47. The method assigns exponentially increasing weights when previous observations get closer to the current state, with older observations being assigned a relatively lesser weights47. Winters method offers two methodologies to execute forecasting analysis; either additive method or multiplicative method45,46,48,49. Additive method is used when the data shows seasonality that is roughly constant, while multiplicative method is used when seasonal variations change proportionally and rapidly to the level of time series44,45,46. As our data is more inclined to the former, we used the additive method. The mathematical formula is given below:

$$Level:\;S_{t} = \alpha \left( {\frac{{X_{t} }}{{I_{t - s} }}} \right) + \left( {1 - \alpha } \right) \left( {S_{t - 1} + T_{t - 1} } \right)$$
(1)
$$Trend:\;T_{t} = \gamma \left( {S_{t} - S_{t - 1} } \right) + \left( {1 - \gamma } \right)T_{t - 1}$$
(2)
$$Seasonality:\;I_{t} = \delta \left( {\frac{{X_{t} }}{{S_{t} }}} \right) + \left( {1 - \delta } \right)I_{t - 1}$$
(3)
$$Forecasting:\;\hat{X}_{t} \left( k \right) = \left( {S_{t} + kT_{t} } \right)I_{t - s + k}$$
(4)

in which α, γ and δ denote smoothing parameters, and St, Tt and It represent smoothing equations of levels, trends and seasonality. The data from observed values (Xt) is projected through the forecasting Eq. (4), at k steps ahead to yield prediction, \(\hat{X}_{t} \left( k \right)\)46,50.

Ethics statement

This study was approved and registered with the National Medical Research Registry of Malaysia (registration number: NMRR-19-1067-48224-IIR).

Conference presentation

Findings from this study was presented at the 6th Asia-Pacific Conference on Public Health, 22nd–25th July, 2019 at the Equatorial Hotel, Penang, Malaysia.

Results

Trends of stroke health information-seeking behaviors

The most common search query was the English term ‘stroke.’ Between January 2004 and 31st March 2019 (n = 183), a total of 6,282 ‘stroke’ hit search queries were generated through Google Trends in Malaysia. The interest over time of internet search queries showed a cyclical pattern within a 2-year interval, and subsequently exhibited seasonality over the years (Fig. 1). Correlograms that yielded autocorrelation and partial autocorrelation plots showed statistical significance with series of time lags, and dataset was at randomness (Fig. 2).

Figure 1
figure 1

Google Trends of ‘stroke’ hit searches over the years. Data was mined since inception from 2004 till 31st March 2019. The top figure panel exhibits query patterns of all terms with similar meaning used in Malaysia: ‘stroke’ in English; ‘strok’ and ‘angin ahmar’ in Malay; ‘cerebrovascular accident’ and ‘CVA’ as medical terms. The bottom figure panel exhibits pattern of the most common search query, ‘stroke’ in English. Figure panels were created in R version 3.5.135 (www.R-project.org).

Figure 2
figure 2

Autocorrelation and partial autocorrelation plots for ‘stroke’ search queries. Data was mined since inception from 2004 till 31st March 2019. Statistical significance exists between series of time lags (P < 0.05). Correlograms were plotted using wessa.net time series function37. Yielded parameters: lambda = 1, d = 0, and D = 0 indicated no transformation or differencing was applied before PACF was computed. 95% confidence interval (CI) was computed assuming white noise time series. ACF autocorrelation function; PACF partial autocorrelation function.

Variations of search volumes by months and weather

The mean percentage of stroke search volume was significantly higher for the period of January to April and June to December in comparison to the month of May (P < 0.01 for January–February, April, June–October and December compared to May; P = 0.016 for March vs May; P = 0.014 for November vs May) (Table 1). When analyzed by weather, average search volume was higher during the Northeast Monsoon in comparison to the Southwest Monsoon (P < 0.001) (Table 1).

Table 1 Mean percentage of stroke search volumes compared with reference month and weather.

Geographic variations of stroke health information-seeking behaviors in Malaysia

Figure 3 illustrates a choropleth map that exhibits the geo-spatial distribution of ‘stroke’ HISB across all states in Malaysia. The yielded map observed a geographical gradient within Peninsular Malaysia, with higher hit-search flux volumes originated from the East Coast Region (Kelantan and Terengganu), Northern Region (Perlis) and the Southern Region (Negeri Sembilan). The states from the Central Region (Selangor and the Federal Territories) yielded a relatively moderate to mild flux volumes. However, flux volumes from East Malaysia (Borneo states) were relatively moderate to high. The top five Malaysian states with high search flux volumes of ‘stroke’ were Kelantan (100), Perlis (83), Terengganu (81), Negeri Sembilan (76) and Pahang (76). The top five Malaysian cities or towns with high search flux volumes were Kota Bharu (100), Batu Pahat (82), Ampang Jaya (78), Kuala Terengganu (77) and Sungai Petani (76). Queries of ‘stroke’ search flux volumes were normalized, eliminating crude absolute values.

Figure 3
figure 3

Choropleth map showing distribution of “stroke” search queries in Malaysia. Data was mined since inception from 2004 till 31st March 2019. Choropleth map was generated by merging Google Trends ‘stroke’ hit search queries multi-timeline data with the Global Administrative Dataset (GADM—level 1 data: Malaysia)38; available from the Center of Spatial Sciences at the following link: https://gadm.org/download_country_v3.html. Choropleth map was created in R version 3.5.135 (www.R-project.org).

Distribution of stroke in Malaysia

Between 2012 and March 2019, there were 14,396 stroke cases recorded across eleven states in Malaysia. Within months, January recorded 1,351 cases, February (1,111 cases), March (1,296 cases), April (1,180 cases), May (1,305 cases), June (1,295 cases), July (1,054 cases), August (1,179 cases), September (1,045 cases), October (1,183 cases), November (1,311 cases), December (1,086 cases). Figure 4 exhibits a choropleth map that yields the geo-spatial distribution of ‘stroke’ cases in Malaysia. The generated map showed consistencies of geographical gradient between stroke cases and hit searches across regions within Peninsular Malaysia. Stroke cases were higher in the East Coast Region (Kelantan and Terengganu), Northern Region (Pulau Pinang, Kedah and Perlis) and the Southern Region (Negeri Sembilan). However, geographical gradient of stroke cases across states were contrary to hits search volumes, with Terengganu recorded a “red alert” of the highest stroke counts in Malaysia (6,744 cases), followed by Sarawak (2,340 cases), Pulau Pinang (1754 cases), Kelantan (1,620 cases), Kedah (623 cases), Perlis (554 cases) and Selangor (510 cases).

Figure 4
figure 4

Choropleth map showing distribution of stroke in each state in Malaysia. Data was mined since inception from 2012 till 31st March 2019. Official count data was retrieved with permissions from the NNEUR of Malaysia – an official registry that captures stroke data within the Ministry of Health Malaysia facilities countrywide. Malaysia’s stroke count data included eleven states (excluded Federal Territories, Negeri Sembilan and Melaka due to unavailability of data for inclusion into analysis). Choropleth map was generated by merging actual counts data from the official NNEUR data with the Global Administrative Dataset (GADM – level 1 data: Malaysia)38; available from the Center of Spatial Sciences at the following link: https://gadm.org/download_country_v3.html. Choropleth map was created in R version 3.5.135 (www.R-project.org).

Correlations of stroke-related Google Trends search queries

Table 2 exhibits correlations between stroke related Google Trends search queries. Stroke symptoms and signs and risk factors were the most searched stroke-related terms in the population. Most stroke-related search queries showed positive correlations with statistical significance (P < 0.05). Across all search queries, “stroke and weakness” showed the strongest positive relationship (r = 0.851, P = 0.014) followed by the risk factor “stroke and family” (r = 0.401, P < 0.001).

Table 2 Correlations of stroke-related Google Trends search queries.

Correlations of stroke Google Trends search query and stroke counts

Most states in Malaysia showed statistical significance between ‘stroke’ Google Trends search query with actual counts of stroke. From the countrywide perspective, Malaysia showed a statistically significant negative correlation between ‘stroke’ search query and actual counts data. With the exception of Pulau Pinang and Sarawak that showed a statistically significant positive correlation between ‘stroke’ search query and actual counts data, the remaining states of Perlis, Terengganu, Selangor, Kedah and Sabah showed statistical significance with negative correlations (Table 3).

Table 3 Correlations between stroke-related search query and actual stroke counts data.

Forecasting model of stroke in Malaysia

Figure 5 shows an estimated forecasting model of stroke in Malaysia. The initial correlograms showed that degradation of points in series of time lags to near-zero was slow, suggesting that the data was at non-stationary. We subsequently confirmed stationary based on unit-root tests. The Augmented Dickey Fuller test showed non-statistical significance (P = 0.722), while the Kwiatkowski-Philips-Schmidt-Shin (KPSS) test was statistically significant (P = 0.001), indicating the presence of non-stationary, thus subjecting our model to exponential smoothing. The yielded forecasted model using Winters additive method was statistically significant (P = 0.001), accounting for 62.7% of the total variance explained. The multi-fitted data within the 95% confidence interval showed that ‘stroke’ Google Trends search query would continue to rise but the incidence of stroke may decrease slightly or reach a plateau within the next 3 years (Fig. 5).

Figure 5
figure 5

Stroke forecasted model for Malaysia. Forecasted Time Series Modeler was yielded in IBM SPSS Statistics version 22.036.

Discussion

This countrywide ecological correlation and time series study utilized the combination of ‘digital epidemiology’ through novel data stream (Google Trends internet data) and ‘classical epidemiology’ of surveillance count data through disease registry that was explicitly aimed to nurture a comprehensive population health-forecasting model of stroke in Malaysia. With rising stroke incidence, we set to address the Malaysian populations’ HISB of stroke in real-time situations, how these behaviors were changing over time with weather variations and geographic gradients, and how would Malaysians be impacted by the current stroke scenario in the future. The trends and patterns yielded in this preliminary spatial epidemiological and time series analytical approach from the Malaysian perspective would set the direction of public health policy preventive measures and tertiary level management guidelines for stroke in the country.

We observed one significant peak of hit searches in 2016. The relatively high search volumes of ‘stroke’ in 2016 could be attributed to the initiation of massive rigorous campaigns and interventions at the hospital and community level nationwide. In 2015, stroke emerged as the second highest non-communicable disease afflicting Malaysians. Malaysia’s leading efforts in combating stroke was recognized by the World Stroke Organization in 2016 when the country’s sole rehabilitation hospital was awarded with the best institutional campaigner to prevent stroke in the low and middle income country category51. From the public health perspective, advocates called upon immediate unification of various stakeholders from the government, private and non-governmental organizations to integrate the nationwide hypertension campaign called the “The Morning Hype Campaign” with the “My Stroke Story Photo Exhibition Campaign,” the largest ever representation that involved thirty one stroke survivors who were empowered to submit their photo stories depicting their personal journeys of stroke survival with the desire to live life to the fullest52. A touching phenomenon that grabbed media attention in 2016 was the news depicting a Malaysian suffering stroke in London and the family being hit with an excruciatingly high hospital bill, halting further treatment for stroke. Malaysians’ emotions were triggered and an online fund raising campaign was launched to allow fundraisers to channel donations and to follow the health progress of the stroke survivor53. These phenomena may have triggered the spike of multiple hit searches of stroke in Google across Malaysia in 2016.

Over an 18-year period, we observed that populations’ HISB of stroke showed a cyclical pattern within a 2-year interval, and subsequently extended to a seasonality trend over the recent years (as evident from Fig. 1). As borderless internet connectivity allows accessibility across all regions in Malaysia with the emergence of Internet of Things (IoTs), the cyclical pattern data yielded through the trend series analysis could be attributed to immediate HISB by stroke afflicted patients, patients’ relatives, family members, colleagues or friends to explore further information about stroke. Google has acknowledged the significance of online health searches and has prioritized the delivery of medically accurate and reliable information30. People searching for information on stroke and their outcomes may do so at the time they are experiencing symptoms and may believe that information provided by Google is accurate for the next course of action. Two possible postulations could be derived from the temporal patterns exhibited in our trend analyses. The first is that people may search for symptoms at the time they are experiencing some discomfort such as limb weakness or slurred speech during the onset of stroke or transient ischemic attack. Such searches could be accomplished by the patients themselves at the early onset of symptoms or by their representatives when their clinical conditions deteriorate further. Secondly, seasonality patterns that could extend over months or years could be attributed by searches accomplished by post-stroke survivors to explore disease prognosis, quality of life, disabilities, treatment strategies and cure. Searches at this period of time could also be conducted by patients’ family members, relatives, friends or colleagues to provide social and functional support in view of the debilitating nature of stroke that impairs activities of daily living (ADL) in post-stroke survivors. These situations may have catalyzed periodic ups and downs of ‘stroke’ hit searches frequently via Google Trends. These consistencies were observed in passively generated search queries from Google Trends that have evaluated seasonal patterns in HISB for a variety of non-communicable diseases2,54,55,56.

HISB showed variations between months and weather. We observed greater peaks of hit search volumes between November and April annually which was parallel with the Northeast Monsoon weather, affirming that a causal link may exists between stroke related information-seeking behaviors mediated by higher incidence of stroke during Northeast Monsoon (6,155 cases) as compared to Southwest Monsoon (5,878 cases). Interestingly, the links between HISB and incidence of stroke during Northeast Monsoon were consistent with the geo-spatial distribution of the yielded choropleth maps. Regions affected during this weather season were the East Coast Region and the Northern Region of Peninsular Malaysia. “Red alerts” were conveyed through the distribution maps exhibiting that the states involved in the two regions, namely Kelantan, Terengganu and Perlis were highly prevalent in terms of stroke incidence and stroke search queries in the country. Previous state-specific study showed that Terengganu had relatively high number of stroke cases33.

The current study was the first from the Asian perspective that has offered triple anticipated relationships in a spatial epidemiological analysis, showing consistencies between HISB and actual stroke counts data with month, weather and geographical variations in the country. Although these findings were consistent with previous studies that explored HISB from a variety of non-communicable diseases through an ecological perspective33, 54,55,56, these studies were limited with only two associations; the relationships between online HISB either with incidence of the disease or seasonal variations. The linkage of these attributes could not be speculated with the pattern of seasonal variations and geographical distribution coherently. Substantial amount of literature have found considerable amount of evidence that meteorological, temperature or weather variations pose greater risk for the occurrence of stroke17,18,19, 57,58,59. Much specifically, the seasonal variations of stroke were more likely to be attributed during colder months60,61,62,63,64,65. These trends were consistent with the findings of our current study that stroke incidence, coupled with high HISB were more prevalent during the colder Northeast Monsoon season. A plausibility of such association could be attributed when seasonal changes occur from warmer to cooler temperatures, causing increased blood viscosities or vasoconstriction, a major predictor of stroke59,64.

Brigo and colleagues postulated that people with chronic health conditions will frequently use search engines to look for terms related to their disease definitions, etiologies, risk factors, symptoms, treatment and prevention strategies66. Our findings were in line with this hypothetical consideration as stroke related Google Trends search queries showed positive correlations with disease pathology, risk factors, symptoms, signs, treatment and prevention. Similar consistencies were observed in online search queries of other diseases or health conditions namely status epilepticus7, multiple sclerosis6 and systemic lupus erythematous8. We also found correlations between HISB and actual stroke incidence across states and countrywide estimate. Although being statistically significant, most states and countrywide associations showed negative correlations between HISB and actual stroke incidence. A plausible explanation of such scenario could be attributed to the nature of the disease or health-related states that are being studied, as the correlation impact of non-communicable diseases are highly complex to decipher due to a number of environmental and lifestyle factors which directly affects the disease states that need to be controlled, such as geography, ethnicity, physical activity, eating habits and social interactions. Although online search queries rise, knowledge of stroke may be improved, lifestyle behaviors could mediate a bidirectional effect of socio-economic status and health. The geographical setting of certain states which are lower in socio-economic status may catalyze a weaker motivation and inadequate resources to maintain a healthy lifestyle. This theoretical model was advocated by Wang & Geng67. We also took note of region-specific estimates that were collectively occupied by certain states. The HISB seemed to correlate well with actual stroke cases across regions but correlation of HISB and state-specific counts showed some inconsistencies as discussed earlier. Similar finding was observed in a previous study from the USA29. Plausible explanations include: (1) state-specific data captured from the registry dataset that was used for comparison by itself was estimated to be limited; (2) when corresponding to regions, states within the particular region are bulked together, yet states with higher socio-economic status or urban areas have better internet penetration, giving rise to greater search queries and yielding positive relationships with actual stroke cases; and (3) geographic differences (either state or region level) on actual stroke risk factors such as ethnicity, diet, obesity, diabetes mellitus or socio-economic status may serve as surrogate markers for greater internet search interests among the population at risks29.

For the first time, we incorporated spatial epidemiology with time series analytics by the utilization of both novel internet data streams and conventional surveillance data of non-communicable diseases. We forecasted a combined impact model that predicted Malaysia’s Stroke 2.0 of HISB and incidence of stroke for the next 3 years. The yielded forecasted model found that, as HISB of stroke continue to rise, the incidence of stroke may slightly decrease or reach a plateau over the next 3 years. Since the spurious peak of stroke searches in 2016 and coupled with ongoing rigorous stroke campaigns, we believed that people tend to explore more about stroke online consistently, thus gaining appropriate up-to-date knowledge on the treatment, control and early prevention of stroke. This could be the reason on why actual stroke cases may have appeared stationary over the subsequent years, yet may be reaching a plateau phase or projected to have a reduced incidence over the next 3 years in our forecasted model. It is giving an important impression that as people explore more information about stroke on the internet, they tend to improve their knowledge and understanding of stroke, succinctly triggering their self-care efforts and control measures to prevent themselves from being afflicted with stroke. We recommend an urgent need for this promising observation through robust analytics and study designs in the near future to test possible variables that may influence such observations. We believe that internet resources have enhanced stroke knowledge, and coupled with efforts of stroke advocates who are currently drafting policy implementations for a paradigm shift of stroke care reform from the vertical to horizontal approaches of prevention strategies through campaigns, community screenings and surveillance efforts may have predicted such observations in the forecasted model.

Public health implications

Internet data analytics is real-time as compared to conventional surveillance or registry data. This tackles the issue of delayed data collection, analyses, forecasting and interpretation of yielded evidence to inform urgent public health policy. Our analysis identified geographic variations of stroke HISB and actual stroke counts across different states in Malaysia. This approach provides a metric to evaluate health disparities among populations at the national level, informing public health practitioners and advocates in the country to direct community health programs and interventions using targeted approach, such as accelerating stroke risk-factor prevention programs and education measures in disproportionately affected states. Temporal trends from query volumes coupled with their geographic distribution and searches could yield a quantifiable and valuable measure of public attention information needs of stroke. The current results that utilized internet data analytics integrated with conventional registry data would catalyze great opportunities for public health agencies to disseminate health information rapidly and efficiently at a cost-effective pace, provided reliable news are shared to the population. It would be timely to see the acceleration of public health informatics applications in the current sense, where new technology explosion within the population through Google Trends could be used as a proxy for proper diffusion strategies based on health education messages, thus filling translational gap between best evidence and practice. Stakeholders from the public health domain could leverage on these new technologies and information overload to plan proper communication strategies for the prevention of stroke.

Study strengths and limitations

The current study which used time series analytics through novel internet data streams, (conceptualized as digital epidemiology through the application of infodemiology and infoveillance methodologies) has offset several disadvantages faced by conventional epidemiological approaches. Digital epidemiology provides real-time information of population’s HISB at the national level. Paired with spatial epidemiological approaches, disease states and risk factors could be detected in high risk areas or regions for quick interventions. The approach is cost-effective and quick to be carried out to notify public health advocates for rapid policy drafting and implementations.

Internet data may have certain limitations that need to be cautioned during interpretation. The first is ambiguity of search keywords as Google Trends monitors only queries carried out in Google search engine. The search terms may not be proxy to individuals with stroke or high risk stroke as academics or professionals who are just interested or curious may provide search hits. The anonymity of Google Trends data limits the exploration of stroke HISB across specific demographics, subpopulations and disparities among populations. This is important as the incidence of stroke is stratified across age, ethnicity, gender and socio-economic characteristics26. Understanding local HISB of stroke is crucial, but Google Trends data are not available for geographical areas smaller than state or city/town level based on yielded search volumes. Google Trends eliminates repeated queries from the same user over a short period of time to reduce counts of continued searching, and uses a certain threshold of traffic volume so that the very new search terms are assigned to a value of zero, but this could change rapidly. As such, the data may not be independently verified or reliable and investigators have limited control over the data, making quality control difficult.

With the revolution of big public health data, the most popular tool for analyzing HISB using web-based data till date is Google Trends4. Online search traffic data was recommended as a good analyzer for internet behavior, and Google Trends has acted as a reliable tool for predicting changes in human behavior; subjected to careful selection of searched terms4. With the selection of valid search terms, Google data can accurately measure population’s interest and behavior68. As we explored and forecasted a particular disease attribute (in this case “stroke”), the search terms and queries will be constant over time. With such valid and consistent terms used to explore disease attributes (e.g. symptoms and signs, risk factors, treatment, etc.), the search terms and analysis are replicable for future research, thus ensures reliability. Moreover, our search terms exploration technique was based on the validated model as proposed by Mavragani & Ochoa13.

Due to the nature of the ecological-correlation study design, the results of our study may be subjected to ecological fallacy as there may be mismatch of drawing conclusions about individual-level stroke epidemiological associations from a group-level data. However, it is a unique and a more appropriate study design to explore trends and patterns for observing correlations of exposures at the population level in exploring a particular disease or public health phenomenon. The current study may be subjected to “mixing” as geographical variations may suffer migrations of population within states, thus diluting differences between groups in our study population. To be consistent with epidemiological concepts in determining disease distribution and determinants, future research using Google Trends data should incorporate individual tracing when users are logged in to their accounts, thus enabling user characteristics retrieval and analyses such as age, gender and ethnicity. The intent of the study would catalyze more meaningful interpretations based on disease risk stratifications of stroke. Such opportunity and usefulness of Google Trends data should be maximized to facilitate public health interventions, health education and promotions, but should be cautioned of use with relevant privacy settings assured.

Conclusion

The current study has provided insights on trends of stroke HISB from internet data that showed possible associations with weather and geographical variations through time series analytics and spatial epidemiology approaches. Search queries were correlated positively with disease characteristics but negatively with actual stroke counts data. Our forecasted model showed that HISB will continue to rise but stroke incidence may reach a plateau within the next 3 years. The current study has offered new real-time surveillance tool and approaches to alert public health systems and policy makers for planning appropriate resources towards stroke detection and prevention in the country. Future studies should validate internet based data with external datasets for reliable use of such approaches.