Comparing Social media and Google to detect and predict severe epidemics

Internet technologies have demonstrated their value for the early detection and prediction of epidemics. In diverse cases, electronic surveillance systems can be created by obtaining and analyzing on-line data, complementing other existing monitoring resources. This paper reports the feasibility of building such a system with search engine and social network data. Concretely, this study aims at gathering evidence on which kind of data source leads to better results. Data have been acquired from the Internet by means of a system which gathered real-time data for 23 weeks. Data on influenza in Greece have been collected from Google and Twitter and they have been compared to influenza data from the official authority of Europe. The data were analyzed by using two models: the ARIMA model computed estimations based on weekly sums and a customized approximate model which uses daily sums. Results indicate that influenza was successfully monitored during the test period. Google data show a high Pearson correlation and a relatively low Mean Absolute Percentage Error (R = 0.933, MAPE = 21.358). Twitter results are slightly better (R = 0.943, MAPE = 18.742). The alternative model is slightly worse than the ARIMA(X) (R = 0.863, MAPE = 22.614), but with a higher mean deviation (abs. mean dev: 5.99% vs 4.74%).

The early detection and prediction of the spread of epidemics is an important concern in public health. Traditional systems and techniques mainly use epidemiological data, such as medical data or health log files obtained from doctors and hospitals. However, several studies have recently included non-epidemiological data obtained through the Internet as an alternative, concretely data extracted from search engines or messages interchanged in social media. Results show that this approach may provide a complementary source to detect and predict the volume and spread of unfolding for severe epidemics. These models also include event-based surveillance (EBS) systems and risk modelling 1 and use available Internet data to detect the evidence of an emerging threat, such as the onset of an epidemic. Some of these tools 2 combine risk management, signal processing and econometrics to construct a novel method towards a forecast of disease outbreaks in Europe. Also, in some cases, infection dynamics are discussed 3-5 based on complex networks of exogenous transmitting or environmental factors.
Some researchers further extend this approach to benefit from Internet of Things (IoT) technologies 6 . In addition, recent advances cover the analysis of real-time data or even the sentiment of the data found inside social media 7 , such as Twitter. The latter approach considers social media data as a word cloud, from which useful metrics and results can be measured and studied.
Previous works used the Google search engine or social media to track or discuss epidemics, and these two are the main sources for detecting and predicting the spread, the size and the peak of serious diseases. It has been found that nearly two thirds (65.70%) of Internet data fulfill the previous goal originating from the Google Search engine and Twitter 8 . Some previous research use both data sources for health purposes, such as the recent work of Jung et al. 9 . Nevertheless, that work does not answer the question about which one is best as a data source: Google or Twitter.
Computer technology offers nowadays many useful tools for mining the Web. In recent years, advanced statistical models have been used to analyze and interpret these data. Auto Regressive Integrated Moving Average model with exogenous (external) variable (ARIMA(X)) is widely used for epidemics, as it has been shown that it can successfully handle Internet data, and also for measurement and forecast [10][11][12] . Data extraction is possible through writing and executing specific programming code that uses API's (Application Programming Interfaces) through various development frameworks, such as Pytrends 13 , Tweepy 14 or Twython 15 . Furthermore, internet data may sometimes show some form of inconsistency, e.g. major rise or descent over time, exaggeration or underestimation of real events (such as epidemics) or repeating presence of the same content, which may significantly affect the detection of epidemics through the Internet.
In this research, we use data from Google Trends 16 and Twitter to detect influenza in Greece. Data from Google have been obtained through the web application of Google Trends and Twitter data through the Twitter STREAMING API 17 and the Tweepy library. Two models have been applied for the analysis of data; an ARIMA(X) model based on weekly data and a custom approximate model based on daily data. By creating a real-time information system, we then compared not only Google and Twitter, but also the two sources to each other. Our final objective was to determine the effectiveness of this information system in a low-cost platform, i.e. an application server with inexpensive components.
Based on the above, the specific objectives of this work are: • To compare and determine which data source is better in tracking epidemics; Google or Twitter, as the current literature provides limited information on this comparison. • To apply an ARIMA(X) model and compare an approximate model to minimize variances of the prediction values due to any possible inconsistency of the data from Twitter mentioned above. • To demonstrate that an internet surveillance system may provide with the necessary tools to detect and predict epidemics in real-time with an inexpensive hardware and software configuration.
The rest of this paper is structured as follows: Section 2 describes the methods and tools used for this study, then Section 3 contains the results. Discussion and conclusions are included in the last section. The Appendices contain the description of the used platform and the ARIMA(X) approximation algorithm.

Methods and tools
Research questions. The research plan was designed and executed in three different phases: preparation, data gathering and data analysis. The research was conducted during a nine-month period, from August 1, 2018 until the 30 th of June of 2019. The software and hardware preparations lasted about 4 months and a half, the data capturing period was 23 weeks and the analysis phase was about one month.
The plan was designed to answer the following research questions (RQs): RQ1:Which source can provide more accurate estimation and prediction of the influenza development: Google or Twitter? RQ2: In which cases can we build and use an approximate model to the ARIMA(X) statistical model with accurate predictions? What are the merits of such as model?
The alternative model mentioned in RQ2 accounts for particularities of the data, concretely: to minimize the effect of repeated messages from Twitter (retweets), or the occurrence of messages with the same or similar content, and to properly adjust and smoothen the prediction curve in cases of a sudden increasing or descending infection impact or when the influenza activity is very low.

Data.
We obtained three data sets for a period of 23 weeks, from December 13, 2018 until May 20, 2019.
All sets contain data on influenza in Greece: one of them is from the European Center of Disease Control and Prevention (ECDC). ECDC has established a joint Centre 18 for Disease Prevention and Control with the World Health Organization (WHO) Regional office. Flu News Europe bulletin includes monitoring data and guidelines on influenza activity in the 50 Member-States of the European Region 19 . The second set came from Google Trends and the third from Twitter. For Google and Twitter, we used as search tokens two terms in the Greek language: Γρίπη and γρίπη. These are equivalent to the English word Influenza. The first keyword is mainly used in the beginning of the sentence, while the second one can be used in the middle. We used both as we tried to have, as much as a big data sample, especially from Twitter.
The Data from Google and ECDC is counted weekly, while Twitter data is real-time and aggregated into daily and also into weekly volumes for comparison purposes. Health data from ECDC corresponds to the consultation rates for influenza-like illness (ILI) for Greece, which are presented by the Flu NEWs Europe system. This data is based on nationally organized sentinel networks of primary care physicians, mostly general practitioners (GPs), covering at least 1-5% of the population in their countries and is calculated per 100,000 population. Twitter data are the messages, called tweets, which were interchanged between people in Greece, concerning influenza for the period considered. We gathered 18,128 tweets in total during the 23-week period, and we used them all despite that there were repeating or forwarded messages ("retweets").
Platform (Hardware and Software). The hardware included the minimum components to run the designed software. The computer, the operational system and an uninterrupted power supply unit (UPS). The total cost of the hardware was about 375€ and it is shown in the Appendix 1. 628 lines of programming code were written in Python, VB.NET and MS-DOS, while the Windows 10 Pro 64-bit was used as the operational system (OS). The Twitter API, which was used was the STREAMING API (instead of the REST API 20 ), along with theTweepy library version.3.6.0 21 in Pyhton 3.5.2 64-bit for Windows 22 . A part of the software module with the code for the applied calculation algorithm can be found in Appendix 2. The entire system was designed to perform the following tasks: 1. To gather and store the messages from Twitter in plain text files, so that the size of these files can be small, as much as possible. The fields of the files were year, month, day, time, week number, text of the message and the username. 2. To calculate daily and weekly sums of the messages and store them in another text file. 3. To estimate the influenza activity by using the written algorithm, store the results into text files and construct a dynamic real-time graph and tables of measurements on the application window form. 4. To self-test if the connection is broken for some reason and re-establish it to minimize the loss of data. 5. To update current estimation and trend every five seconds.
Statistical methods. We used the three-parameter Auto Regression Integrated Moving Average model with exogenous (external) variable (ARIMA(X) (p, q, d)) to estimate the influenza development on a weekly basis. This model is constructed by setting the ILI cases as the dependent variable (predicted variable) and Twitter messages as the independent (external) variable. The ARIMA model can be generally thought as discrete time linear equation with noise as shown in equation: Where: p = order of the autoregressive part of the model, d = order of differencing and q = order of the moving average part of the model X t = the predicted value L = the lag operator If we assume that Y t = (1-L) d X t = Y t + X t -1 + … + X tn , then the model can be expressed as shown in the following equation: In our forecasting model, we include both time series of tweets and the values of influenza, where Y t stands for the predicted value of influenza cases and X t is the value of tweets.
The above model was compared with the parameters for lag 0 of the Google data, as well as for Twitter data, in order to perform the first comparison. The second comparison was made only for Twitter.
The model that it was finally used was an ARIMA (0,1,0), as shown in the following equation: Where Y′ t is the new predicted value of influenza and X t the predictor (values from the internet) for every week (t). We checked the ARIMA(X) model against a customized approximate model. The rationale to do so was that (i) sometimes ARIMA(X) models could be complex enough to be converted into an algorithm and translated into a programming code and (ii) we wish to show that an alternative model could also be used, consisting of specific and consecutive conditions with similar results. This alternative model was simple enough and can be seen in the following set of equations in the following Figure. As seen in the above Fig. 1, specific adjustments were made to decrease or increase the value of the tweets in special cases. The comparisons were evaluated for precision by using the following criteria: • Correlation Coefficient (R) and its statistical significance (p-value two-tailed) • Mean Absolute Percentage Error (MAPE) and • Mean absolute deviation as percentage of the true influenza cases www.nature.com/scientificreports www.nature.com/scientificreports/ As a matter of fact, the presented alternative model is based upon the following three needs: i) to investigate the capability of tracking the weekly and daily trend based on a daily estimation of influenza, since there is no official daily data, ii) to minimize the effect of the repeated messages from Twitter (called Retweets), or the occurrence of messages with the same or similar content, iii) to properly adjust and smoothen the prediction curve in cases of a sudden increasing or descending infection impact or when the influenza activity is very low. This, indeed, happens when the linearity of the data is disturbed, something we have noticed many times when the classic linear models generally fail to deal with this problem.
During the execution of the electronic system and its applications 24/7, we monitored and logged the computer processor (CPU) and memory (RAM) usage and network bandwidth.

Results: Estimation and Prediction
The first comparison of Google and Twitter reveals that Twitter is slightly better in terms of precision. The values from Google Trends showed a correlation (Pearson R) of 0.933, which is very high and significant at the level of <0.05. Its significance was found to be 0.00000000026, while the observed MAPE was 21.358 and the mean deviation was 4.40%). In the following Fig. 2, we can observe the influenza development of true and predicted cases by using Google data.
The blue line represents the true influenza cases, measured by the ECDC, while the red line is the predicted values of the ARIMA(X) model based on the Google data. We can observe that the influenza outbreak occurs in the first weeks of February 2019. The model captures this outbreak by calculating it to 1,248.68 cases when, at this time, the actual influenza activity was observed to be 1,178.40. The difference is very small 70.28 cases and it is only 0.56%. The MAPE was measured 21.358, On the other hand, Twitter data show a similar pattern with some small differences from the Google data, as shown in the following Fig. 3.
In the above Fig. 3, we can see that there is also a strong correlation. The correlation coefficient is larger 0.943 at significance level <0.05. However, the peak prediction is slightly higher 1,298.12. The error of the estimate (MAPE) was lower and it was measured 18.742. Finally, the mean deviation was a little bigger (4.74%).
Comparing the results from the ARIMA(X) model and the alternative model, it was found that the custom model was not as good as the ARIMA(X) but could also be considered as reliable. Below, the following graph shows the results (Fig. 4).
The Correlation Coefficient (Pearson R) is lower, but high enough (0.905) at significance level of <0.05 (0.00000011795), while the error (MAPE) was almost the same (22.614) as in the Google model and the absolute mean deviation is higher than ARIMA models, both from Google and Twitter (5.99%). In the following Table 1 we can see all the statistics, calculated for the three models: From the above Table 1 and the figures, we can make some interesting observations:  www.nature.com/scientificreports www.nature.com/scientificreports/ • All models capture the influenza activity well and estimate the peak at the right time.

Discussion and conclusions
prediction of epidemics. From the above results, it is clearly shown that Internet data can be useful in tracking epidemics. Particularly, Google and Twitter show the potential to estimate and predict influenza, before it is observed in the population. It seems that Twitter has some advantages over Google in our case: it achieves slightly better accuracy, while the monitoring could also be made in smaller time intervals than the official surveillance systems. For instance, European surveillance monitors this epidemic on weekly basis, while, by using Twitter, we are able to monitor the activity and even predict its spread in population daily or within some hours. Google data however is also reliable, but sometimes it was shown 23 that missed some epidemics, such as the 2009 influenza pandemic through Google Flu Trends. In our test case, both data sources were very accurate in estimating the overall development of influenza at the right time and with the right volume. The ARIMA(X) model was proven to be very effective in this process. Different methods have already been proposed for or against ARIMA for health purposes, such the Holt-Winter method 24 or the linear regression methods, e.g. for Ebola 25 . In the latter case, it was shown that a Single Linear Regression (SLR) method resulted in overestimation of the eventual epidemic size, if applied to a real-life data. Scarpino et al. (2019) notice that fundamental limits to outbreak prediction are not clear and, and because of this, an integrative approach is needed regarding models for predicting epidemics 26 . In our test case, the results show that epidemics can be certainly predicted by using ARIMA models or even approximate methods. It may though be sometimes very difficult to apply all ARIMA models within the framework of a real-time computer surveillance system. ARIMA Models with many parameters above zero, e.g. ARIMA (2,3,2) require special and time-consuming techniques in order to be realized into a programming code.
On the other hand, other custom machine learning models, less accurate, can be easily transferred and integrated into machine code. Our approximate model worked correctly and, despite some of the observed deviations from the ARIMA(X) model, derived interesting results, strong correlation and accuracy in predicting the influenza activity, regarding time and size. Undoubtedly, monitoring of severe epidemics must lead to control and prevention. Influenza has many implications and may result to many deaths, if not prevented and treated in time. In Greece alone, 135 deaths were reported and confirmed during the 2018-2019 season 27 .  www.nature.com/scientificreports www.nature.com/scientificreports/ information systems and data. Internet surveillance systems have the potential to assist the traditional systems of monitoring epidemics. However, during the recent years, there is a discussion on the actual role of them. Particularly, whether they should replace the existing surveillance means or may have a supportive role or should be applied as an extension of the current surveillance tools. Big data, on the other hand, which can be gathered through both traditional and internet surveillance systems, require an effective management. As N. Peek, et al. 28 notice, health and biomedicine data are now abundant, while at the same time too complex to be managed through the traditional methods and practice.
Considering that Twitter users communicate to each other with almost 500 million messages every day and over 200 billion messages every year 29 , we believe that the usability of Twitter as a source of data is obvious. Nevertheless, there is no need for a researcher to capture all tweets or all tweet information with the tools provided from the specific APIs created for this purpose. We can limit this in some hours of day, especially for countries larger than Greece. Of course, data handling should be adequate, and the data itself should be properly used for surveillance of serious epidemics in a way that we can produce reliable results.
Regarding the data, the volume that can be retrieved through the internet is very important. Cholera, for instance, is not a common disease and can be tracked in underdeveloped countries, such as in the regions of West Africa. The severity of influenza is also not the same in all countries for a given period. Nevertheless, when the data size is enough, Twitter has the potential to show the trends regarding the spread of influenza in the countries in which it occurs, in a way that enables estimation and early prediction.
Twitter and Google have some major differences regarding the data that can be extracted. The data from Twitter is plenty as Twitter users communicate to each other frequently each day and furthermore is more detailed than these of Google. This may introduce a significant aspect, specifically when we examine smaller areas and not a vast country such as the US or China. Another difference is the time interval. It is more convenient by using Twitter to allocate more data in smaller segments of time, e.g. less than a day.
Limitations. Twitter can be used as a very useful data source, specifically when other resources do not provide good results compared to less popular searches. Influenza is considered a disease with a big interest in the public. The analysis of the tweets could be more detailed in terms of time, since we can set the time frame and gather tweets every minute or hour, every 12 or less hours, every day etc. This, of course, requires big databases and extensive analysis, but we believe that the tools are now present and will be further developed as technology advances. In our test models, the size of the data and the produced electronic files were small, and we did not need to localize the tweets for this purpose, since the particularities of the Greek language make them easily identifiable. On the other hand, it could be impossible to track messages from Twitter, based only on the language. For the English term influenza, for example, it would be necessary to locate the tweets if we want to study a specific area or a country. This is important since location information (geo-tagging) on Twitter is limited to 0.4-0.5% of the total size of the tweets, while 26% or less of the users report their city-level location. This happens because many users do not want their location to be published on-line. In this case, either the data sample must be big enough or other special techniques must be applied to resolve this issue 30,31 . By contrast, localization of Google search volume can be easily made, both from their web application and through APIs such as Pytrends.
Another aspect is the statistical approach. When a known distribution is identified, it certainly simplifies the analysis process, particularly in cases of low correlation data. However, if the parameters of the distributions are quite different for many different periods, there is a possibility that an alternative method would be probably needed. For example, the use of different simple linear or log linear regressions for separate years or the use of non-parametric mixed models. Generally, it's the form of data and the desired outcome that defines the most appropriate method to be used, e.g. when we want to maximize the correlation or to achieve better estimation and predictions for the peak of the occurrence of an infectious disease. Nevertheless, in the case of influenza in Greece, the used models both with Google and Twitter provide a good estimation and prediction of the general pattern of the spread, the peak of the disease and the time it occurs.
Nowadays, many different technologies can be used to retrieve real-time tweets, such as the Twitter REST API or the Twitter STREAMING API. REST APIs have limitations of use called rate limits 32 . These limits are per user or per application and are divided into 15-minute windows. This means that the catching of the tweets stops, and some time is required to start gathering tweets again. The solution to this problem is to use the streaming API, which can be executed for longer-lived connections with Twitter. This second API has no rate limits, but certain read time-outs occur after some hours of the connection and running the streaming. It is possible though to restart the stream after a while. There are also very popular electronic libraries to be used with the streaming API, such as Twython, Tweepy etc. Unfortunately, it is now too difficult to trace back tweets, even though the Twitter API parameters include the since parameter. Despite this fact, it is possible to collect many tweets real-time or from the past tracing the web (https://twitter.com/search-home), but it can be difficult to locate the origin of the tweets. To gain full access to the Twitter API is not allowed by Twitter. There are resellers however 33 , such as DataShift, GNIP Inc. and Topsy, each with different levels of access that can provide large amounts of Twitter data by purchasing them. This access is done by Twitter Firehose.
On the other hand, Google data can be found both from the web application of Google or by using specific API's, such as Pytrends. Google Trends data is an unbiased sample of Google search data and only a percentage of searches are used to compile trends data 34 . This may indicate a small degree of error and the results may be variable to it. The same can be said about the REST API or even more for the streaming API of Twitter. Some API's, such as Pytrends, may result in non-reliable data when search terms contain non-speaking English words with denotation symbols. With that said, along with the sometimes unpredicting behavior of humans (exaggeration or underestimation reactions to some events), internet data may mislead to wrong results. www.nature.com/scientificreports www.nature.com/scientificreports/ conclusions. From the current literature, it seems that in many cases it is possible to conduct estimation and prediction patterns for epidemics from internet data. We confirm this assumption and we have further examined Google and Twitter data and found out that both can produce precise and usable estimation of the influenza development. Although the proposed models were applied to influenza, it would be possible that these models or similar ones could produce similar results for other epidemics as well. Nevertheless, this is an assumption that needs to be confirmed. Regarding internet data, Twitter seems to be slightly better, since it provides better prediction for the entire time-series and has some more advantages over Google. The main criterion of assessing the prediction pattern is certainly the accurate prediction of time and volume when an outbreak is noticed. Nevertheless, additional measurements could be applied, and other evaluation statistical criteria can also be used; such as the mean deviation, or the mean absolute percentage error. As far the ARIMA(X) model is concerned, it worked correctly, as well as the approximate alternative model, despite the less accuracy of prediction. In both cases the correlation between real and predicted influenza cases are found to be very strong. This approximate model could be proved very handy when an ARIMA(X) model is too complex to be translated into machine code.
Although Twitter or even Google data require advanced technical skills to be retrieved and analyzed, there is no doubt that it could be useful. Early tracking of epidemics, on the occasions that this can be achieved, could drive to better alerting, controlling and preventing of them. It is also very promising that ad-hoc electronic systems can be implemented with low costs and could certainly be easily managed as supplementary systems in contrast to the traditional monitoring systems. This means that, if the applied models can achieve precise predictions, we don't really need epidemiological data at all. However, this is partly true, and it is not always validated. This aspect is very important, specifically on some occasions when the forecast is not so good due to the composition or lack of some data or even because of an unexpected reaction to epidemics of people who use search engines or social media.  (9)  For x = m -2 To m dataDay2(i, k, x) = dataDay(i, k, x) * 0.7 Next x End If End If ' adjuestemnt 3: very high values If dataDay(i, k, m) > 1000 Then dataDay2(i, k, m) = dataDay2(i, k, m) / 2 ' adjustment3 too high Next m Next k Next i 'time of last entry TimeMinute = Mid(data(cRow, 1), 15, 2) ' last minute recording TimeStamp = data(cRow, 1) End Sub