Abstract
Research based on investor sentiment in social media has been a hot topic of research in behavioral finance, and the reliability of investor sentiment mined from social media is a potential condition for the reliability of the results of these studies. In the past, scholars have often focused on using more reliable tools to track investor sentiment in order to get more reliable investor sentiment. However, less attention has been paid to another key factor affecting the reliability of investor sentiment on social media: the selection and collection of data. In this study, we systematically investigate the process of data selection and collection in relation to the construction of investor sentiment on social media. Our findings suggest that the process of creating a dataset from social media is a process that starts and ends with a research question. In this process, we need to overcome various obstacles to end up with an imperfect dataset. The researchers must take a series of steps to get close to the best dataset and acknowledge some of the shortcomings and limitations. We emphasize that the absence of accepted, reliable standards makes it particularly important to follow basic principles. This study is an important reference for social media-based behavioral finance research.
Similar content being viewed by others
Introduction
Around 2001, it seems to have marked the beginning of a widespread awareness of the research issues and opportunities presented by sentiment analysis and opinion mining. A number of prospective studies emerged concerning investor sentiment in social networks (Cardie et al. 2003; S. Das and Chen, 2001; Dave et al. 2003; Dini and Mazzini, 2002; H. Liu et al. 2003; Morinaga et al. 2002; Nasukawa and Yi, 2003; Tateishi, 2001). Subsequently, hundreds of papers have been published on this topic in a very short period of time. Pang and Lee (2008) suggested that there are three potential reasons behind this “land grab” phenomenon:
-
The rise of machine learning methods in the fields of natural language processing and information retrieval.
-
The availability of data for research is due to the boom of the World Wide Web, and in particular, the growth of social media.
-
Fascinating application prospects.
The emergence of social media has provided researchers with a new, wealthy, and easily accessible data source about individuals, societies, and even the world at large (Schoen et al. 2013). In the era of big data and artificial intelligence, social media sentiment becomes an important complement to traditional sentiment proxy approaches based on survey research, market indicators, and search engines (Antweiler and Frank, 2004; Da et al. 2015).
The study of social media-based investor sentiment has been one of the focal issues in behavioral finance research in the last two decades, along with the development of the Internet and artificial intelligence (Chen et al. 2014; Cookson and Niessner, 2020; Das et al. 2005; Gilbert and Karahalios, 2010; Liu et al. 2022; Mai et al. 2018). Many scholars have studied the relationship between investor sentiment and stock markets based on sentiment in social media (Liu et al. 2022; Xiong et al. 2017). However, the findings reported by researchers are inconsistent; for example, some researchers report that the sentiments from social media have no predictive ability (Antweiler and Frank, 2004; Tumarkin and Whitelaw, 2001), while others describe weak or strong predictive ability (Liu et al. 2022; Mao et al. 2011). Thus, research on social media sentiment remains an open question (Nguyen and Shirai, 2015).
Mao et al. (2011) pointed out that all of the findings about investor sentiment in social media were based on unique datasets and combinations of tools for tracking sentiment. Scholars have extracted investor sentiment from social media in different and widely varying ways (Long et al. 2021). Yet there are no accepted guidelines for measuring social media investor sentiment, and social media research is still on the path to establishing methodological standards and/or best practices (Mayr and Weller, 2017). This has called into question the reliability of social media investor sentiment (Mao et al. 2011). The reliability of sentiment mining tools has often been considered the most important aspect of ensuring the reliability of investor sentiment in social media in historical studies (Kušen et al. 2017; Q. Li et al. 2018; Loughran and Mcdonald, 2016). Instead, descriptions of data sources and collection are often passed over (Eierle et al. 2022; Gan et al. 2020; Jiao et al. 2020).
As highlighted by See-To and Yang (2017), reliable datasets and the tools for accurate measurement are two of the challenges in investigating social media sentiment. As shown in Fig. 1, ensuring that sentiment mined from social media can be used in finance research needs to be tested against two fundamental questions.
Q1: Is there a relationship between the data collected from social media and the financial market under study?
Q2: Is the researcher parsing real investor sentiment from social media data?
The combination of these two questions ensures that the researcher has obtained appropriate social media data and parsed reliable investor sentiment from it. This is highly consistent with Mao et al. (2011), who highlighted that “findings on social media-based investor sentiment are dependent on a unique, specific combination of datasets and sentiment tracking tools.” As we have explained earlier, however, the reliability of the dataset is the foundation and prerequisite for the reliability of investor sentiment. However, in related studies, scholars have tended to describe social media data in an often open-ended and inadequate manner.
In this study, we systematically investigate the process of selecting and collecting data for finance research based on investor sentiment in social media and summarize the basic recommendations that should be followed. We show that the process of creating a dataset is a struggle and a compromise with noise and error, and that all we have to do is keep the error or noise to a low, acceptable level. This survey has significant implications for the measurement of investor sentiment in social media, behavioral finance, and social media mining.
The subsequent sections of this paper are organized as follows: Section 2 is the methodology, describing the core literature sources and the framework of the review. Section 3 presents the results, describing the main review findings in terms of data sources, technical framework, and limitations of the dataset. Section 4 discusses research insights and limitations. And finally, there is the conclusion.
Methods
Core reference list
To systematically review the methods of constructing datasets from social media and propose a general framework for dataset construction, we conducted a series of key literature searches. To ensure the impact and timeliness of the article, we performed Google ScholarFootnote 1 searches under two conditions: without time restrictions and with a restriction from 2020 to the present:
-
Without time restrictions, we searched using keywords such as “social media” and “investor sentiment” and sorted the results by relevance and citation count. We obtained a total of 420 highly-ranked literature records.
-
Under the condition of limiting the publication year to 2020 to the present, we used the same keywords for the search and sorted the results by relevance and citation count. We successfully retrieved 80 literature records.
After removing duplicate studies, we screened 474 studies. Among these, 291 studies were excluded for not meeting the inclusion criteria, primarily based on the results of title and abstract reviews. Subsequently, the full texts of the remaining 183 studies were reviewed, of which 85 met the inclusion criteria. The main review criteria were as follows:
-
The study uses social media text for sentiment analysis.
-
The study explores the relationship between investor sentiment and financial markets.
-
The paper includes a detailed description of the dataset source, sentiment mining from text, and the construction of sentiment indices.
Based on the above screening method, we compiled and obtained a list of 85 core reference papers, which can be found in Appendix A. Figure 2 illustrates the selection process for the studies included in this review.
Among the 85 core reference papers selected, some studies utilized data sources such as StockTwits, Yahoo! Finance, and Eastmoney BBS, which are specialized electronic communities for investors, rather than general social media platforms. Nonetheless, these platforms still possess the fundamental characteristics of social media and have been widely used in sentiment analysis research related to investor sentiment (Chen et al. 2014; Gloor et al. 2009; Liu et al. 2022).
Review framework
Research question (Data source to answer the research question)
Borgman (2012) pointed out that researchers use various methods to collect data for multiple purposes. When conducting research, it is crucial for researchers to ascertain whether there is sufficient data to support their research questions (Castleberry and Nolen, 2018). Like all types of research, studies based on social media require first defining a clear research question and then obtaining relevant data to address this question (Mayr and Weller, 2017).
Data collection
Once the data source is determined based on the research question, data collection becomes another critical dimension to consider, as highlighted by Castleberry and Nolen (2018). Data collection involves the data time frame, data quantity, and techniques required to address the research question. Mayr and Weller (2017) summarized these issues as the technological framework upon which data collection relies.
Limitations
Researchers first identify the data source based on the research question and then seek a method (i.e., a technological framework) that comes as close as possible to an optimal dataset for data collection (Mayr and Weller, 2017). However, researchers must acknowledge the limitations and deficiencies of the collected data, which may stem from legal, ethical, and technical aspects (Castleberry and Nolen, 2018).
In summary, historical research emphasizes three core dimensions to consider when constructing a dataset: research questions, data collection, and validation of limitations, as shown in Fig. 3. Among these, research questions directly or indirectly determine the technical framework for data sources and collection methods. After obtaining the dataset, it is essential to conduct a validation of limitations based on the research questions to fully understand the data’s constraints and its applicability in addressing the research questions.
Results
Following the investigative framework in Fig. 3, we systematically reviewed the process of selecting and collecting social media sentiment data from three dimensions: data source, technological framework, and limitations. All three dimensions revolve around the research question. In behavioral finance research based on social media investor sentiment, if the real sentiment (\({{IS}}_{{Real}}\)) in the social media dataset used fails to have an impact on a particular financial market (\({FM}\)), then the conclusion of invalidity has been established before the research begins. Researchers must carefully examine what data they need to collect and use and be cognizant of the expected impact that particular data collection methods may have on analytical methods and research findings (Mayr and Weller, 2017).
Data source
Social media
Table 1 provides statistics on the sources of traditional social media used to mine investor sentiment (see Appendix A for detailed data). The survey data shows that 35.5% of the researchers used Twitter data, 2.6% of the studies used Sina Weibo data, and 2.6% of the studies used Facebook data. Weller (2015) stated that Twitter and Facebook were the most frequent social media platforms for academics to study social media sentiment. Our findings are slightly different from Weller’s (2015) but are in line with the current findings. According to the global-social media ranking (2022)Footnote 2 based on disclosed monthly active user counts, the most popular social media platforms are as follows: Facebook (2.91 billion), Instagram (1.478 billion), Sina Weibo (573 million), Snapchat (557 million), and Twitter (436 million).
However, as people mainly use Snapchat and Instagram for following entertainment, fashion, and sharing personal content (Phua et al. 2017; Vaterlaus et al. 2016). So global-social media platforms suitable for mining investor sentiment include Facebook, Sina Weibo, and Twitter. Twitter and Facebook are social media sites allowed by the US Securities and Exchange Commission (SEC) for companies to use (Gallu, 2013). So, it seems likely that what people do on social media could affect how well a company’s stock does (L. Liu et al. 2015). Sina Weibo has been the most important microblogging service provider in China since its launch in 2009 (Cheng and Lin, 2013), so studies based on Sina Weibo have mainly focused on the Chinese market.
More than 1/3 of researchers use Twitter to study social media investor sentiment, making Twitter the most popular medium for investor sentiment mining. Twitter’s application programming interface (API) provides an easy-to-use application programming interface that allows researchers to easily access data. Users can tag relevant topics (using the “#keyword” syntax) or specific stocks (e.g., “$GOOG” for Google) via a generic API (T. Li et al. 2018; See-To and Yang, 2017). The dollar sign (“$“) filter, also known as the “cashtag,” is a flag for investors to share their investment-related opinions on a specific stock. This cashtag was promoted by StockTwits, a platform dedicated to sharing stock-related opinions (See-To and Yang, 2017). The Twitter-independent Trendogate app allows users to query Twitter data using location, date, and subject tags (Nisar and Yeung, 2018). Studies have also shown that Twitter’s online sentiment can be better for financial phenomena; for example, Naeem et al. (2021) showed that a Twitter-based sentiment proxy approach outperformed a Google-based online investor sentiment proxy. As can be seen, the sufficient amount of data, common API, easy retrieval, and data targeting make Twitter the most popular platform for investor sentiment analysis.
Facebook has the largest social media user base (Wilson et al. 2012). Users can write their status updates in a box containing open-ended questions such as “What are you thinking?” This has allowed scholars to visually assess users’ emotions (Siganos et al. 2017). However, in October 2011, Facebook began to stop reporting information from Argentina, Australia, Austria, Canada, Ireland, New Zealand, Singapore, and South Africa (Siganos et al. 2017). In addition, Facebook users use the platform more for self-promotion than Twitter (Java et al. 2007). This is probably why Facebook has the largest user base but is not the most popular source of investor sentiment data.
Since the withdrawal of Twitter from the Chinese market in 2009, several local social media companies have launched similar services that have quickly gained popularity. Sina Weibo, launched on August 14, 2009, is the largest microblogging platform in China, with over 100 million messages posted daily (Y. Sun et al. 2020). As of January 2022, Sina Weibo had 573 million usersFootnote 3. If the study is limited to the Chinese market and microblog-based emotional agents, Sina Weibo seems to be the only option (Y. Sun et al. 2020).
eCommunity
Investor e-communities are another major vehicle for scholars to study investor sentiment on social media. As Table 2 shows, investor e-communities have a richer selection than everyday social media. StockTwits, Yahoo! Finance, and Eastmoney BBS were used as data sources in 42.1% of the cases, and these three investor e-communities, along with Twitter, are the primary vehicles for scholars to study social media investors.
Many studies have concluded that electronic communities are better suited as vehicles for investor sentiment than everyday social software (Deng et al. 2018; J. R. Piñeiro-Chousa et al. 2016). Because they are specifically about investing, they result in less noise than datasets collected from more general microblogging services (Oliveira et al. 2013). Nguyen et al. (2015) showed that the primary reason they did not choose Twitter as a source of sentiment was that the messages on Twitter appeared to be more confusing than those on message boards.
StockTwits is a microblogging site developed by the Twitter team for investors and investment professionals to share information and ideas about the financial markets (StockTwits, 2014). Similar to Twitter messages, each post on StockTwits is limited to 140 characters. Individuals, investors, market professionals, and listed companies can post on it, so StockTwits can be used to “tap into the pulse of the market” (Renault, 2017). StockTwits has the following advantages:
-
Since September 2012, StockTwits has implemented a new feature that allows users to directly select the sentiment they want to express when posting messages on the platform, such as bullish or bearish (Renault, 2020). This allows academics to be spared the hassle of text parsing.
-
StockTwits invests a lot of effort in filtering out non-financial-related messages and spam. High-quality and large-scale text data are provided for investor sentiment mining (A. Sun et al. 2016).
-
Like Twitter, StockTwits offers an easy-to-use API and the ability to locate desired messages by symbols such as the dollar (“$“), which predate even Twitter’s adoption (See-To and Yang, 2017).
These advantages make it the most researched electronic community for investors (J. R. Piñeiro-Chousa et al. 2016).
Yahoo! Finance is another electronic community that has received extensive attention from researchers (Gu et al. 2007). Especially until 2010, Yahoo! Finance was the most important vehicle for scholars to study sentiment on electronic message boards (see Appendix A). As shown in Table 2, 14.5% of studies used Yahoo! Finance as a vehicle for social media investor sentiment mining, second only to StockTwits. This is consistent with what Das and Chen (2007) and Sabherwal et al. (2011) reported. In recent years, however, there seems to have been a growing preference among scholars to use StockTwits and Eastmoney BBS to investigate investor sentiment.
Eastmoney BBS is the most visited and influential financial portal in China. It leads the industry in terms of effective browsing time, core web traffic, and average daily visits (Ackert et al. 2016). Eastmoney BBS is characterized by high information content, short texts, and rich emotions (Q. Liu et al. 2023, 2024; Y. Sun et al. 2018). In the absence of Twitter and StockTwits in the Chinese market, Eastmoney BBS has become the main vehicle for studying social media sentiment in the Chinese financial market (Y. Li et al. 2020; Y. Sun et al. 2020).
As shown in Table 2, there are many other electronic communities that are used for social media investor sentiment mining, but StockTwits, Yahoo! Finance and Eastmoney BBS are among the most important data sources. Some of these data sources listed in Table 2 no longer even exist, such as Reuters 3000 Xtra, an electronic trading platform released by Reuters in 1999 and supported until the end of 2013. It was typically used by professional traders and financial analysts in their trading rooms. It was replaced by the Eikon platform, which was first released in 2010Footnote 4.
With regard to the choice of data source, there is also the issue of using a single data source versus a composite data source. In order to study a particular social media platform in depth, it is certainly appropriate to use a single data source approach (Boyd and Crawford, 2012). Sometimes, however, comparisons with other platforms are necessary in order to demonstrate whether the observed phenomena are unique to that data source or a feature shared by multiple data sources. Quan-Haase and Young (2010) showed the value of comparing multi-platform data sources.
Thinking about the ideal dataset should of course also include asking whether social media really provides the best data source—or whether there are other, more desirable options, such as experiments, survey data, and content from traditional mass media (Mayr and Weller, 2017). Having received a positive answer, the first question to ponder is that of data sources, i.e., which social media platforms are most relevant to the research question? In the absence of general guidelines, academics often need to answer the question “Why is the data source the best?” themselves. In summary, we highlight the basic principles for describing data sources:
Principle D1:
A clear disclosure of the data sources used for the study and the reasons for their selection can help lend credibility to the work. The basic elements that should be elaborated on include:
-
What data source did we use?
-
What are the characteristics of the data source chosen?
-
Why is this data source most relevant to our work?
Technology framework
Having identified the data source, we need to develop a technical framework to ensure that the ideal dataset can be collected to answer the question under study. Referring to Borgman (2012) and Mayr and Weller (2017), the technology framework should contain three basic dimensions: how the data will be collected, the time frame for data collection, and how much data will be collected.
How to collect data
In our survey, 77.6% of the researchers used Twitter (35.5%), StockTwits (18.4%), Yahoo! Finance (14.5%), and Eastmoney BBS (9.2%) as data sources. Our survey found that the method of data collection was highly dependent on the type of data source, as shown in Fig. 4. Twitter provides an easy-to-use API (Hamraoui and Boubaker, 2022; Java et al. 2007), so most studies used the official API to collect data. There are also platforms available for purchasing historical and current public tweets or third-party tweets (See-To and Yang, 2017), for example, Gnip (http://gnip.com/twitter). StockTwits also provides an official data API (Guégan and Renault, 2021), which is used in a similar way to Twitter (See-To and Yang, 2017). However, its popularity is not as high as that of the Twitter API. Taken together, official APIs, crawlers, and third-party datasets are the main ways to obtain data from data sources. In addition, a large number of studies do not disclose their data acquisition methods.
The specific way in which data are collected determines the content of the data collected and they are highly dependent on the question being studied. Researchers using Twitter as a data source can locate the desired content by topic, keyword, or money match (Audrino et al. 2020; Rao and Srivastava, 2012b). On the other hand, Sina Weibo researchers lack the support of an official API and easy content retrieval notation and often focus on specific user accounts or keyword searches to get the content that the data wants to match with the research question (S. Chen et al. 2011; H. Zhou and Pan, 2016).
In addition, many studies require more detail about the data, such as the location of the user who created the data. Vu et al. (2012) wanted to control the market sample in the US in their study, so they geographically focused their Twitter queries on four major cities (New York, Chicago, Los Angeles, and San Francisco). Nisar and Yeung (2018) described how location queries can be used on Twitter. Giannini et al. (2018) note that sentiment in non-local Twitter posts has a very different impact on future reporting than sentiment in local Twitter posts, a phenomenon that Huang et al. (2016) refer to as “hometown bias.” These studies for specific geographic locations also pose new requirements for data collection methods.
Social media texts contain multiple dimensions, and the researcher needs to determine what specific content to use to match the research question. For example, title, author, posting time, number of reads, comments, and body text. For the Weibo data, additional details on the followers of the Weibo account are included (Z. Wang et al. 2020). Which of these elements is needed in the study, and how will these elements have an impact on the findings? These are questions that need to be considered and determined.
In summary, the basic way of collecting data, the specific path, and the detailed content are the fundamental questions facing the researcher, provided that the data sources are identified. Furthermore, it must be remembered that all this expertise must also evolve as social media platforms and their APIs change. Researchers need to think about the ideal solution in terms of the research question in order to find the best dataset. In the literature we surveyed, 37.3% of the work did not disclose the basic methods of data acquisition, which can lead to confusion for the reader. A second principle regarding datasets is presented here:
Principle D2:
Clear disclosure of the basic methods, specific pathways, and content details of data collection helps to make the work more convincing. Its concerns are:
-
In what ways do we acquire data?
-
How do we filter and target the ideal content?
-What specific content does a basic piece of data contain?
Time frame
Time is a fundamental dimension of data collection, and, as with the choice of data source, the time frame of data collection needs to help answer the question under study. Mayr and Weller (2017) highlighted that we need to consider the time frame in conjunction with the underlying strategy of the data collection setting.
Our survey found that 97.4% of researchers agreed that it was important to disclose the temporal dimension of the data and disclosed the basic data time horizon in their studies (see Appendix A). 55.3% of the work in these studies disclosed the year-month-day interval of the data. For example, Kaminski (2016) used Twitter data from 2013.11.23 to 2014.03.07 to analyze the correlation and causality between bitcoin market indicators and Twitter posts. 34.2% of the work disclosed the year-month interval of the data. For example, Wu et al. (2014) used Sina Finance data between July 2012 and November 2012 to report a strong correlation between stock price volatility trends and stock forum sentiment. 7.9% of the work disclosed the year interval used in the study. For example, Audrino et al. (2020) used a novel and extensive dataset of Twitter, StockTwits, and search engine data from 2012 until 2016 to analyze the impact of sentiment and attention variables on stock market volatility.
Jungherr et al. (2012) argued that it is not enough to simply disclose the time frame used in the study; the researcher must be clear that the time frame chosen may significantly affect the results of the study and articulate their rationale for choosing this time frame. Kranefuss and Johnson (2021), for example, avoided the impact of market shocks on the findings of the study through the choice of time frame. In addition, the choice of time frame becomes even more necessary for research studies that are set in the context of social events. Some examples are the effect of COVID-19 on investor sentiment (Chen et al. 2020; Huynh et al. 2021) and the effect of political elections on investor sentiment (Oyebode and Orji, 2019; Shaikh, 2019).
If there is no suitable reason to choose the necessary time frame, then there is no reason why we should not choose the most recent data. Some studies have shown that investor sentiment in social media is not the same across time (Heston et al. 2010; Wei et al. 2016). Choosing the most recent data helps to obtain conclusions that are consistent with the present (Liu et al. 2022). If the data are too old in time, they may reach a conclusion that was once correct but may not be accurate now.
Based on the above analysis, a third basic principle is proposed regarding the dataset.
Principle D3:
-
Scholars should clearly disclose the time frame used for the study and articulate why that time frame is conducive to the response to the research question.
-
If the research question does not have a specific requirement for a time frame, using the most recent data is beneficial in obtaining conclusions that are consistent with the present.
-
If the research question does not specifically require a time frame and the scholar is unable to use the most recent data, it is necessary to make a statement as to why—for example, a data access dilemma.
How much data to collect
Gloor et al. (2009) classified the information domain of the World Wide Web into three categories. They argued that the whole web represents the “wisdom of crowds,” blogs represent the “wisdom of experts,” and forums represent the “wisdom of swarms.” Deng et al. (2018) and Wei et al. (2016) argued that sentiment on social media reflects the “wisdom of crowds” or “collective wisdom.” Bollen et al. (2011) argued that although each tweet represents an individual opinion, the aggregate sample should accurately reflect public sentiment. According to the phenomenon of “group wisdom,” since each tweet represents an individual opinion, the aggregate sample should accurately reflect public sentiment. According to the phenomenon of “group wisdom,” a large group of people performs better than a small group or minority (Eierle et al. 2022; Nofer and Hinz, 2015). These studies show that research that is based on social networks needs a large enough amount of data.
Due to the lack of general guidelines, researchers need to find out for themselves the amount of data needed to answer specific research questions (Mayr and Weller, 2017). Ackert et al. (2016) used data from the investor community to examine the predictive power of investor sentiment on investor returns. In the study, Ackert et al. (2016) removed stocks that have a low volume of social media texts, arguing that the smaller volume of data may bias the study sample. Tumarkin et al. (2002) emphasized that only stocks with sufficient social media data should be used in stock market research based on social media sentiment, and Garcia (2013) also showed that a sufficiently large data size gives the study more statistical power. It is clear that sufficient data volume is crucial for social media-based research.
As shown in Table 3, the median size of the data used by the researchers in our survey was 1,109,500, while the mean was as high as 23,643,300. As a result, there is reason to believe that sentiment mining in social media is a big data application. However, there are some reports that the impact of additional data on research is marginal as the magnitude of the data increases. For example, Renault (2020) found that prediction accuracy was stable at 250,000 to 500,000 messages and increased by only 0.31 percentage points when the size of the dataset increased from 500,000 to 1 million messages. Larger data volumes lead to greater workloads and study difficulties, but no study has reported that larger data volumes negatively affect the results.
Based on the above elaboration, the fourth principle regarding datasets is proposed.
Principle D4:
It is necessary for scholars to disclose the volume of data used in the study and to explain “Why this level of data is available to answer the question being studied?”
Limitations
Scholars start with a research question and think about the ideal data source and data specification needed to answer it. Yet the ideal dataset may not be achievable due to technical (Borra and Rieder, 2014), policy, legal, or ethical constraints (Bishop and Gray, 2017). In crossing these barriers, necessary compromises have to be made (Mayr and Weller, 2017; Nguyen et al. 2015; Y. Sun et al. 2020; Toivonen et al. 2019).
Barriers of law and ethics
In early 2016, the information of 70,000 users of OK Cupid, a dating social media platform, was made available by a researcher in a public database (Metzler et al. 2016). This led to a public outcry and a court case, but it did not change the fact that it was “already public” (Zimmer, 2020). A central debate is whether social media data are public, and if so, does this mean that they can be disseminated without restriction (Bishop and Gray, 2017)? Even for the sake of open science, social media data access is still subject to legal and ethical scrutiny, and carefree data disclosure challenges respect for privacy (Ess and Jones, 2004).
Limitations of policy and technology
Researchers are intrinsically and strongly motivated by the access to and sharing of social media data (Weller, 2015). As data provision platforms, social media platforms have to impose policy and technical restrictions on access to data in order to meet legal and ethical requirements (Borra and Rieder, 2014). For example, the Twitter API only supports real-time crawling of tweets as they are posted; it is not possible to retrospectively search for data on topics related to past events (so-called “historical” Twitter data) (Weller, 2015). This makes it difficult for researchers to collect large amounts of data from Twitter when it would take a year to collect tweets over the course of a year. Under current policies, Twitter data requesters are only able to access 1% of public data (Rao and Srivastava, 2012a; X. Zhang et al. 2011), and researchers have to accept the statistical error that this may cause (Phillips and LaPole, 2003; Robertson, 1999). Zhang et al. (2011) show that researchers were able to access 1% of overall Twitter sentiment through an API interface. If researchers were to partially address the above situation, they may need to pay additional fees to purchase public tweets (See-To and Yang, 2017).
Crawling techniques are an important way for researchers to obtain social media data, especially when acquiring data from the investor community. Crawlers were used in 65.2% of the disclosed studies based on investor communities (see Appendix A). However, many social networking sites use different methods to protect their data (Z. Zhou, 2014). When using crawling techniques to crawl data, we must consider the issue of limiting the speed of crawling and how to address the anti-crawling strategies used by these sites (H. Wang et al. 2018).
Limitations of the collection paths
Many methods of collecting data create dataset-specific biases (Bruns and Stieglitz, 2014). Lorentzen and Nolin (2017) reported on the limitations arising from label-based data collection methods. The core keywords used to locate the data are among the most important issues that can directly affect the content and magnitude of the data obtained (Zheludev et al. 2014). For example, Rao and Srivastava (2012a) used Twitter data to study stock market volatility when they collected data from June 2010 to July 2011. The volume of data was 4,025,600. However, in another study, they collected data for a longer period of time (June 2010–September 2011) and only located 1.964 million records (Rao and Srivastava, 2012b).
In addition, due to the large amount of data available on social media, in some cases, only a portion of it could be used for the study. The sampling method used to collect the data may introduce additional errors (Abdesslem et al. 2012). In conclusion, researchers must be very aware of the potential limitations of their collection methods and should outline the consequences this may have on the results obtained (Mayr and Weller, 2017).
Discussion
We systematically reviewed the issues of data selection and collection in historical research from the perspectives of data sources, technical frameworks, and limitations. However, it is crucial to recognize that obtaining a perfect dataset to address our research questions is challenging. In this section, we will further explore these reasons, summarize the insights gained from the study, and discuss its limitations.
Imperfect datasets
Social media data collection is a process that starts with the research question and ends with the research question. We start with a research question, think about the ideal data source and data framework, and then cross the barriers to get the desired data. In the process, we have to make the necessary compromises to get as ideal a dataset as possible. We then have to use this imperfect data to investigate the research question and recognize the biases involved.
Undesirable time frame
One of the signs of a non-desirable time frame is that data for the desired years is not available. Zhang and Skiena (2010) studied the impact of news on company stock performance using a broad data source of blogs and news from 2005 to 2009. However, the data they collected included:
-
data from Twitter in 2009;
-
data from Spinn3r RSS Feeds (a global blog collection) from 2008 to 2009;
-
data from LiveJournal from 2006 to 2008.
It is hard to figure out how the different time periods of these data might have affected the study’s results.In the study of Audrino et al. (2020), data from January 2014 as well as January and February 2015 were missing, thus making the study data discontinuous. They had to use neutral sentiment to fill in the missing data and figure out how that affected the results of the study.
The undesirable time frame is also evidenced by the staleness of the dataset. Table 4 counts the relative year difference (RYD) between the time of publication of articles and the time stamp of the social media dataset used for the study for 74 publications. The RYD reflected the newness of the dataset used for the study, with the 75% quantile of the RYD being 4 years, indicating that a quarter of the studies used data that was 4 years old or even older. The maximum value of RYD is 8 years.
Undesirable volume of data
We have already discussed that emotions on social media reflect “crowd wisdom” or “collective wisdom” (Deng et al. 2018; Wei et al. 2016). Emotions on social media are a collection of numerous individual emotions. As shown in Table 3, the median size of the data used by researchers in our survey was 1,109,500, while the mean was 23,643,300, with only a quarter of the studies using less than 178,900.
If the amount of data in a study is grossly inadequate due to various barriers in data creation, it can affect the credibility of the work. We have observed many such examples, for example:
-
Chua et al. (2009) used HotCopper ASX data from the first half of 2004 to study the sentiment classification of the investor community. The amount of data used by Chua et al. (2009) is 8307 posts. However, Renault (2020), in a similar study, noted that the classification performance stabilized at a data volume of 250,000 to 500,000 messages.
-
Nofer (2015) used social media data to study stock return forecasting using over 5 years of investor community data between May 2007 and August 2011, with only 10,146 text messages.
-
Rizkiana (2021) collected tweet data from Stockbit for five stocks from April 10, 2017 to December 1, 2017. The average number of tweets per day for each stock is less than 2.
-
With a data volume of only 856, Hamraoui and Boubaker (2022) collected 1 year of tweets for 81 stocks. Yet it contained 81 stock tweets for 1 year.
When the magnitude of the social media data deviates significantly from the concept of “mass wisdom” or “collective wisdom,” the reader is forced to question the conclusions of the study.
Limitations of third-party datasets
An open-ended study usually starts with a research question. In order to answer the research question, the researcher must cross various barriers to obtain the ideal dataset. And when they anticipate not being able to cross these barriers or encountering difficulties, they may consider using an off-the-shelf dataset. As a result, they would begin their research not with the problem itself but with an existing dataset (F. Li, 2010). At this point, they are both supported and limited by the nature of the data sources available (Schroeder, 2014).
Other noises
There may also be unexpected obstacles to overcome during the data acquisition process. These barriers may lead to additional noise. For example, when Nofer and Hinz (2015) used Twitter data to measure sentiment levels in Germany, they found that only 1% of all tweets were written in German, while 50% were written in EnglishFootnote 5.
They believed it was reasonable to consider both English and German tweets and translated all tweets into English. This process generates additional data noise. Also, even the same language is represented differently in different countries, and this can cause errors in the later sentiment classification. For example, Gouws et al. (2011) noted a significant difference in word abbreviation behavior between UK and US microtext authors.
As shown in Fig. 5, we start with a research question and go about selecting the ideal data source to answer that question. We then have to develop a data technology framework to obtain the ideal dataset. In the process of collecting data, we need to overcome various obstacles and end up with an imperfect dataset. We have to find a way to approach the best dataset and acknowledge some shortcomings and limitations. It is important to always envision the ideal dataset and then reduce it to the best dataset given the current limitations (Mayr and Weller, 2017).
In summary, we propose the fifth principle regarding datasets.
Principle D5: Due to legal, ethical, and technical barriers, we had to make compromises in the collection of data to obtain a dataset that was close to ideal.
We must acknowledge the shortcomings and limitations of the collected datasets and assess the impact of these shortcomings on the results of the study.
Research implications and limitations
This study carries three significant implications. Firstly, we have introduced a comprehensive guiding framework, providing a theoretical foundation for the collection of financial textual data from social media, as shown in Table 5. This framework helps address the inconsistent data quality issues faced by current behavioral finance research. We emphasize that the absence of accepted, reliable standards makes it particularly important to follow basic principles.
Secondly, the reliability of investor sentiment extracted from social media is a crucial condition to ensure the reliability of the research findings. In historical studies, sentiment mining has often been regarded as the most critical step in constructing investor sentiment (Kušen et al. 2017; Q. Li et al. 2018; Loughran and Mcdonald, 2016), while the aspect of data sources has sometimes been overlooked or simplified (Eierle et al. 2022; Gan et al. 2020; Jiao et al. 2020). This study helps fill this gap in current research.
Lastly, there is a significant inconsistency in the research conclusions regarding social media sentiment. In this emerging field, it’s challenging to distinguish clear relationships between measurement and tool effects and actual online sentiment and market value. This makes sentiment surveys based on social media an unresolved open issue (Nguyen and Shirai, 2015). This study holds significant practical implications for enhancing the reliability of investor sentiment measurement and the reproducibility of historical research results.
However, this study has its limitations. Research on investor sentiment based on social media has consistently been a focal point in behavioral finance studies. Given the limited sample size, we were unable to conduct a fully comprehensive survey. However, we have indeed provided a meaningful cross-sectional investigation. Additionally, due to space constraints and research capabilities, some indicators related to investor sentiment were not included in the review scope. For example, there remains insufficient data on the investigation of investor disagreements.
Conclusion
The advent of social media has provided researchers with a new, rich, and easily accessible source of data about individuals, society, and even the world at large (Schoen et al. 2013). Research based on investor sentiment in social media has been one of the focal issues in behavioral finance research for nearly 20 years (Cookson and Niessner, 2020; S. Das et al. 2005; Gilbert and Karahalios, 2010; Mai et al. 2018). It has also been accompanied by researchers’ concerns about the reliability of investor sentiment on social media (Nguyen and Shirai, 2015).
In historical research, sentiment mining is often considered the most important aspect of constructing investor sentiment from social media (Kušen et al. 2017; Q. Li et al. 2018; Loughran and Mcdonald, 2016), while the part on data source selection and data collection is often overlooked (Eierle et al. 2022; Gan et al. 2020; Jiao et al. 2020). Yet the reliability of data sources is a prerequisite for mining reliable data sources from social media. In this study, we systematically reviewed the process of selecting and collecting social media data for the construction of investor sentiment and provided systematic recommendations on key questions. The key questions on which these recommendations focus are shown in the table. We show that following these basic recommendations will enhance the reliability of the study and be important for both the replication and comparison of the research findings.
Scholars start with a research question and select the ideal data source to answer that question. Scholars need to establish a data technology framework to obtain the ideal dataset. However, there are still various barriers to overcome during the acquisition and collection processes, resulting in an imperfect dataset. From the beginning, this work has been a process of struggle and compromise with noise and error. We show that following these basic recommendations can improve the quality of the dataset, thereby reducing noise in the data source and increasing the reliability and credibility of the study.
The measurement of investor sentiment based on social media presents promising research prospects, yet it also confronts various challenges, with the establishment of data sources being a pivotal step in addressing these challenges. Through a review of historical studies, we have strived to conduct a comprehensive examination to gain a deep understanding of the entire process of measuring investor sentiment based on social media. However, in this rapidly evolving research field, updating and conducting systematic reviews with larger sample sizes remain of paramount importance. Additionally, due to space and research capacity limitations, we were unable to include an investigation related to investor disagreement, which is also a significant factor in shaping investor sentiment. Therefore, future supplementary research can be conducted based on investor disagreement. Finally, large language models (LLMs) have had a significant impact on both the generation of social media data and the construction of investor sentiment, which is a topic for further discussion in the future.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. The core literature list is in Appendix A.
Notes
References
Abdesslem, FB, Parris, I, & Henderson, T (2012). Reliable online social network data collection. In: Abraham, A (ed) Computational social networks: mining and Visualization. Springer, p 183–210
Ackert LF, Jiang L, Lee HS, Liu J (2016) Influential investors in online stock forums. Int Rev Financ Anal 45:39–46
Antweiler W, Frank MZ (2004) Is all that talk just noise? The information content of internet stock message boards. J Finance 59(3):1259–1294
Audrino F, Sigrist F, Ballinari D (2020) The impact of sentiment and attention measures on stock market volatility. Int J Forecast 36(2):334–357
Bishop, L, & Gray, D (2017) Ethical challenges of publishing and sharing social media research data. In: K Woodfield K (ed) The ethics of online research, vol 2. Emerald Publishing Limited, p 159–187
Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8
Borgman CL (2012) The conundrum of sharing research data. J Am Soc Inform Sci Technol 63(6):1059–1078
Borra, E, & Rieder, B (2014) Programmed method: developing a toolset for capturing and analyzing tweets. Aslib J Inf Manag. https://doi.org/10.1108/ajim-09-2013-0094
Boyd D, Crawford K (2012) Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inf Commun Soc 15(5):662–679
Bruns A, Stieglitz S (2014) Twitter data: what do they represent? Inf Technol 56(5):240–245
Cardie, C, Wiebe, J, Wilson, T, & Litman, DJ (2003) Combining low-level and summary representations of opinions for multi-perspective question answering. In: New directions in question answering, 20–27
Castleberry A, Nolen A (2018) Thematic analysis of qualitative research data: is it as easy as it sounds? Curr Pharm Teach Learn 10(6):807–815
Chen C, Liu L, Zhao N (2020) Fear sentiment, uncertainty, and bitcoin price dynamics: the case of COVID-19. Emerg Mark Finance Trade 56(10):2298–2309
Chen H, De P, Hu Y (Jeffrey), Hwang, B-H (2014) Wisdom of crowds: the value of stock opinions transmitted through social media. Rev Financ Stud 27(5):1367–1403
Chen S, Zhang H, Lin M, Lv S (2011) Comparision of microblogging service between Sina Weibo and Twitter. Proc 2011 International Conference on Computer Science and Network Technology 4:2259–2263
Cheng W, Lin J (2013) Investor sentiment and stock market indices in social media. Manag Sci 26(5):111–119
Chua, C, Milosavljevic, M, & Curran, JR (2009) A sentiment detection engine for internet stock message boards. Proceedings of the Australasian Language Technology Association Workshop 2009, 89–93
Cookson JA, Niessner M (2020) Why don’t we agree? Evidence from a social network of investors. J Finance 75(1):173–228
Corea F (2016) Can Twitter proxy the investors’ sentiment? The case for the technology sector. Big Data Res 4:70–74
Da Z, Engelberg J, Gao P (2015) The sum of all FEARS investor sentiment and asset prices. Rev Financ Stud 28(1):1–32
Das S, Chen M (2001) Yahoo! for Amazon: extracting market sentiment from stock message boards. Proceedings of the Asia Pacific Finance Association Annual Conference (APFA) 35:43
Das S, Martínez-Jerez A, Tufano P (2005) eInformation: a clinical study of investor discussion and sentiment. Financ Manag 34(3):103–137
Das SR, Chen MY (2007) Yahoo! for Amazon: sentiment extraction from small talk on the web. Manag Sci 53(9):1375–1388
Dave, K, Lawrence, S, Pennock, DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. Proceedings of the 12th International Conference on World Wide Web, 519–528
Deng S, Huang ZJ, Sinha AP, Zhao H (2018) The interaction between microblog sentiment and stock return: An empirical examination. MIS Quarterly 42(3):895–918
Dini, L, Mazzini, G (2002) Opinion classification through information extraction. WIT Transactions on Information and Communication Technologies, 28. https://doi.org/10.2495/DATA020301
Eierle B, Klamer S, Muck M (2022) Does it really pay off for investors to consider information from social media? Int Rev Financ Anal 81:102074
Ess C, Jones S (2004) Ethical decision-making and Internet research: recommendations from the aoir ethics working committee. In: Readings in virtual research ethics: issues and controversies. IGI Global, p 27–44
Fan R, Talavera O, Tran V (2020) Social media bots and stock markets. Eur Financ Manag 26(3):753–777
Gallu J (2013) SEC approves using facebook, twitter for company disclosures. Bloomberg Report, 3
Gan B, Alexeev V, Bird R, Yeung D (2020) Sensitivity to sentiment: news vs social media. Int Rev Financ Anal 67:101390
Garcia D (2013) Sentiment during recessions. J Finance 68(3):1267–1300
Geva T, Zahavi J (2014) Empirical evaluation of an automated intraday stock recommendation system incorporating both market data and textual news. Decis Support Syst 57:212–223
Giannini R, Irvine P, Shu T (2018) Nonlocal disadvantage: an examination of social media sentiment. Rev Asset Pricing Stud 8(2):293–336
Giannini R, Irvine P, Shu T (2019) The convergence and divergence of investors’ opinions around earnings news: evidence from a social network. J Financ Markets 42:94–120
Gilbert, E, Karahalios, K (2010) Widespread worry and the stock market. Fourth International AAAI Conference on Weblogs and Social Media. https://doi.org/10.1609/icwsm.v4i1.14023
Gloor PA, Krauss J, Nann S, Fischbach K, Schoder D (2009) Web science 2.0: identifying trends through semantic social network analysis. 2009 International Conference on Computational Science and Engineering 4:215–222
Gouws S, Metzler D, Cai C, Hovy E (2011) Contextual bearing on linguistic variation in social media. Proceedings of the Workshop on Language in Social Media (LSM 2011), 20–29. https://aclanthology.org/W11-0704
Gu B, Konana P, Liu A, Rajagopalan B, Ghosh, J (2006) Identifying information in stock message boards and its implications for stock market efficiency. Workshop on Information Systems and Economics, Los Angeles, CA. https://www.ideal.ece.utexas.edu/pdfs/151.pdf
Gu B, Konana P, Rajagopalan B, Chen H-WM (2007) Competition among virtual communities and user valuation: the case of investing-related communities. Inf Syst Res 18(1):68–85
Guégan D, Renault T (2021) Does investor sentiment on social media provide robust information for Bitcoin returns predictability? Finance Res Lett 38:101494
Hamraoui I, Boubaker A (2022) Impact of Twitter sentiment on stock price returns. Soc Netw Anal Min 12(1):28
Heston SL, Korajczyk RA, Sadka R (2010) Intraday patterns in the cross-section of stock returns. J Finance 65(4):1369–1407
Huang Y, Qiu H, Wu Z (2016) Local bias in investor attention: evidence from China’s Internet stock message boards. J Empir Finance 38:338–354
Huynh TLD, Foglia M, Nasir MA, Angelini E (2021) Feverish sentiment and global equity markets during the COVID-19 pandemic. J Econ Behav Org 188:1088–1108
Java, A, Song, X, Finin, T, & Tseng, B (2007) Why we twitter: understanding microblogging usage and communities. Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, 56–65. https://doi.org/10.1145/1348549.1348556
Jiao P, Veiga A, Walther A (2020) Social media, news media and the stock market. J Econ Behav Org 176:63–90
Jungherr A, Jürgens P, Schoen H (2012) Why the pirate party won the german election of 2009 or the trouble with predictions: a response to tumasjan, a., sprenger, to, sander, pg, & welpe, im “predicting elections with twitter: what 140 characters reveal about political sentiment.”. Soc Sci Comput Rev 30(2):229–234
Kaminski J (2016) Nowcasting the Bitcoin market with Twitter signals. Preprint at https://doi.org/10.48550/arXiv.1406.7577
Karabulut Y (2013) Can facebook predict stock market activity? AFA 2013 San Diego Meetings Paper. https://doi.org/10.2139/ssrn.2017099
Kim YB, Kim JG, Kim W, Im JH, Kim TH, Kang SJ, Kim CH (2016) Predicting fluctuations in cryptocurrency transactions based on user comments and replies. PLoS ONE 11(8):e0161197
Kraaijeveld O, De Smedt J (2020) The predictive power of public Twitter sentiment for forecasting cryptocurrency prices. J Int Financ Mark Inst Money 65:101188
Kranefuss E, Johnson DKN (2021) Does Twitter strengthen volatility forecasts? Evidence from the S&P 500, DJIA and Twitter sentiment analysis (SSRN Scholarly Paper 3786251). https://doi.org/10.2139/ssrn.3786251
Kušen E, Cascavilla G, Figl K, Conti M, Strembeck M (2017) Identifying emotions in social media: comparison of word-emotion lexicons. 2017 5th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), 132–137. https://doi.org/10.1109/ficloudw.2017.75
Li F (2010) The information content of forward-looking statements in corporate filings—A naïve Bayesian machine learning approach. J Account Res 48(5):1049–1102
Li J, Bu H, Wu J (2017) Sentiment-aware stock market prediction: a deep learning method. 2017 International Conference on Service Systems and Service Management, 1–6. https://doi.org/10.1109/ICSSSM.2017.7996306
Li Q, Chen Y, Wang J, Chen Y, Chen H (2018) Web media and stock markets: a survey and future directions from a big data perspective. IEEE Transactions on Knowledge and Data Engineering 30(2):381–399
Li T, van Dalen J, van Rees PJ (2018) More than just noise? Examining the information content of stock microblogs on financial markets. J Inf Technol 33(1):50–69
Li Y, Bu H, Li J, Wu J (2020) The role of text-extracted investor sentiment in Chinese stock price prediction with the enhancement of deep learning. Int J Forecast 36(4):1541–1562
Liu H, Lieberman H, Selker, T (2003). A model of textual affect sensing using real-world knowledge. Proceedings of the 8th International Conference on Intelligent User Interfaces, 125–132. https://doi.org/10.1145/604045.604067
Liu L, Wu J, Li P, Li Q (2015) A social-media-based approach to predicting stock comovement. Expert Syst Appl 42(8):3893–3901
Liu Q, Huang M, Zhao L, Lee W-S (2023) The dispositional effects of holidays on investor sentiment: therapeutic and hygienic. J Innov Knowl 8(2):100358
Liu Q, Lee W-S, Huang M, Wu Q (2022) Synergy between stock prices and investor sentiment in social media. Borsa Istanbul Rev. https://doi.org/10.1016/j.bir.2022.09.006
Liu Q, Son H, Lee W-S (2024) The game of lies by stock investors in social media: a study based on city lockdowns in China. Financ Innovation 10(1):65
Long W, Zhao M, Tang Y (2021) Can the Chinese volatility index reflect investor sentiment? Int Rev Financ Anal 73:101612
Lorentzen DG, Nolin J (2017) Approaching completeness: capturing a hashtagged Twitter conversation and its follow-on conversation. Soc Sci Comput Rev 35(2):277–286
Loughran T, Mcdonald B (2016) Textual analysis in accounting and finance: a survey. J Account Res 54(4):1187–1230
Mai F, Bai Q, Shan J, Wang XS, Chiang RH (2015) The impacts of social media on Bitcoin performance
Mai F, Shan Z, Bai Q, Wang X. (Shane), Chiang RHL (2018) How does social media impact Bitcoin value? A test of the silent majority hypothesis. J Manag Inf Syst 35(1):19–52
Mao H, Counts S, Bollen J (2011) Predicting financial markets: comparing survey, news, Twitter and Search Engine data. Preprint at https://doi.org/10.48550/arXiv.1112.1051
Mayr P, Weller K (2017). Think before you collect: setting up a data collection approach for social media studies. In: Solan L, Quan-Haase A (ed) The SAGE handbook of social media research methods. SAGE, p 679
Metzler K, Kim DA, Allum N, Denman A (2016) Who is doing computational social science? Trends in big data research. https://repository.essex.ac.uk/17679/1/compsocsci.pdf
Morinaga S, Yamanishi K, Tateishi K, Fukushima T (2002) Mining product reputations on the web. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 341–349. https://doi.org/10.1145/775047.775098
Naeem MA, Mbarki I, Shahzad SJH (2021) Predictive role of online investor sentiment for cryptocurrency market: evidence from happiness and fears. Int Rev Econ Finance 73:496–514
Nasukawa T, Yi,J (2003) Sentiment analysis: capturing favorability using natural language processing. Proceedings of the 2nd International Conference on Knowledge Capture, 70–77. https://doi.org/10.1145/945645.945658
Nguyen TH, Shirai K (2015) Topic modeling based sentiment analysis on social media for stock market prediction. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1354–1364. https://doi.org/10.3115/v1/p15-1131
Nguyen TH, Shirai K, Velcin J (2015) Sentiment analysis on social media for stock movement prediction. Expert Syst Appl 42(24):9603–9611
Nisar TM, Yeung M (2018) Twitter as a tool for forecasting stock market movements: a short-window event study. J FinanceData Sci 4(2):101–119
Nofer M (2015) Are crowds on the internet wiser than experts? – The case of a stock prediction community. In: Nofer M (ed) The value of social media for predicting stock returns: preconditions, instruments and performance analysis. Springer Fachmedien, pp 27–61
Nofer M, Hinz O (2015) Using Twitter to predict the stock market. Business Inf Syst Eng 57(4):229–242
Oliveira N, Cortez P, Areal N (2013) On the predictability of stock market behavior using stocktwits sentiment and posting volume. Portuguese Conf Artificial Intell. https://doi.org/10.1007/978-3-642-40669-0_31
Oliveira N, Cortez P, Areal N (2016) Stock market sentiment lexicon acquisition using microblogging data and statistical measures. Decis Support Syst 85:62–73
Oliveira N, Cortez P, Areal N (2017) The impact of microblogging data for stock market prediction: using Twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert Syst Appl 73:125–144
Oyebode O, Orji R (2019). Social media and sentiment analysis: the Nigeria presidential election 2019. 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 0140–0146
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135
Phillips CV, LaPole LM (2003) Quantifying errors without random sampling. BMC Med Res Methodol 3(1):1–10
Phua J, Jin SV, Kim JJ (2017) Gratifications of using Facebook, Twitter, Instagram, or Snapchat to follow brands: the moderating effect of social comparison, trust, tie strength, and network homophily on brand identification, brand engagement, brand commitment, and membership intention. Telematics Informatics 34(1):412–424
Piñeiro-Chousa J, López-Cabarcos MÁ, Caby J, Šević A (2021) The influence of investor sentiment on the green bond market. Technol Forecast Soc Change 162:120351
Piñeiro-Chousa JR, López-Cabarcos MÁ, Pérez-Pico AM (2016) Examining the influence of stock market variables on microblogging sentiment. J Business Res 69(6):2087–2092
Quan-Haase A, Young AL (2010) Uses and gratifications of social media: a comparison of Facebook and instant messaging. Bull Sci Technol Soci 30(5):350–361
Rao T, Srivastava S (2012a) Analyzing stock market movements using Twitter sentiment analysis. ASONAM ‘12: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining, August 2012, 119–123. https://doi.org/10.1109/ASONAM.2012.30
Rao T, Srivastav, S (2012b) Using twitter sentiments and search volumes index to predict oil, gold, forex and markets indices. WebSci ‘13: Proceedings of the 5th Annual ACM Web Science Conference, May 2013, 336–345. https://doi.org/10.1145/2464464.2464521
Renault T (2017) Intraday online investor sentiment and return patterns in the U.S. stock market. J Bank Finance 84:25–40
Renault T (2020) Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digital Finance 2(1):1–13
Rizkiana A (2021) Can investor sentiment in social media be used to make investment decision in stock market? Acad Account Financ Stud J 25(1):1–6
Robertson IG (1999) Spatial and multivariate analysis, random sampling error, and analytical noise: empirical Bayesian methods at Teotihuacan, Mexico. Am Antiq. 64(1):137–152
Sabherwal S, Sarkar SK, Zhang Y (2011) Do internet stock message boards influence trading? Evidence from heavily discussed stocks with no fundamental news. J Bus Finance Account 38(9–10):1209–1237
Schoen H, Gayo-Avello D, Takis Metaxas P, Mustafaraj E, Strohmaier M, Gloor P (2013) The power of prediction with social media. Internet Res 23(5):528–543
Schroeder R (2014) Big Data and the brave new world of social media research. Big Data Soc 1(2):2053951714563194
See-To EWK, Yang Y (2017) Market sentiment dispersion and its effects on stock return and volatility. Electronic Markets 27(3):283–296
Shaikh I (2019) The US Presidential Election 2012/2016 and Investors’ Sentiment: the Case of CBOE Market Volatility Index. SAGE Open 9(3):2158244019864175
Siganos A, Vagenas-Nanos E, Verwijmeren P (2017) Divergence of sentiment and stock market trading. J Bank Finance 78:130–141
StockTwits. (2014, June 15). A communications platform for the investing community. StockTwits, http://stocktwits.com/about
Sun A, Lachanski M, Fabozzi FJ (2016) Trade the tweet: Social media text mining and sparse matrix factorization for stock market prediction. Int Rev Financ Anal 48:272–281
Sun Y, Fang M, Wang X (2018) A novel stock recommendation system using Guba sentiment analysis. Pers Ubiquitous Comput 22(3):575–587
Sun Y, Liu X, Chen G, Hao Y, Zhang ZJ (2020) How mood affects the stock market: empirical evidence from microblogs. Inf Manag 57(5):103181
Tateishi K (2001). Opinion information retrieval from the Internet. IEICE technical report, 75–82
Toivonen T, Heikinheimo V, Fink C, Hausmann A, Hiippala T, Järv O, Tenkanen H, Di Minin E (2019) Social media data for conservation science: a methodological overview. Biol Conserv 233:298–315
Tsukioka Y, Yanagi J, Takada T (2018) Investor sentiment extracted from internet stock message boards and IPO puzzles. Int Rev Econ Finance 56:205–217
Tumarkin R, From A, Whitelaw R (2002) Internet message board activity and market efficiency: a case study of the internet service sector using RagingBull. com. Financ Mark Inst Instrum. https://doi.org/10.1111/1468-0416.11403
Tumarkin R, Whitelaw RF (2001) News or noise? Internet postings and stock prices. Financ Anal J 57(3):41–51
Vaterlaus JM, Barnett K, Roche C, Young JA (2016) Snapchat is more personal”: an exploratory study on Snapchat behaviors and young adult interpersonal relationships. Comput Human Behav 62:594–601. https://doi.org/10.1016/j.chb.2016.04.029
Vu TT, Chang S, Ha QT, Collier, N (2012) An experiment in integrating sentiment features for tech stock prediction in twitter. Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data, 23–38. https://aclanthology.org/W12-5503
Wang H, Li C, Zhang L, Shi M (2018) Anti-Crawler strategy and distributed crawler based on Hadoop. 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA), 227–231. https://doi.org/10.1109/ICBDA.2018.8367682
Wang Z, Liu H, Liu W, Wang S (2020) Understanding the power of opinion leaders’ influence on the diffusion process of popular mobile games: travel frog on Sina Weibo. Comput Human Behav 109:106354
Wei W, Mao Y, Wang B (2016) Twitter volume spikes and stock options pricing. Comput Commun 73:271–281
Weller K (2015) Accepting the challenges of social media research. Online Inf Rev 39(3):281–289
Wilson RE, Gosling SD, Graham LT (2012) A review of Facebook research in the social sciences. Perspect Psychol Sci 7(3):203–220
Wu DD, Zheng L, Olson DL (2014) A decision support approach for online stock forum sentiment analysis. IEEE Transactions on Systems, Man, and Cybernetics: Systems 44(8):1077–1087
Xiong X, Chunchun LUO, Ye Z (2017) Stock BBS and trades: the information content of stock BBS. J Syst Sci Math Sci 37(12):2359
Zhang W, Skiena S (2010) Trading strategies to exploit blog and news sentiment. Fourth International aAAI Conference on Weblogs and Social Media. https://doi.org/10.1609/icwsm.v4i1.14075
Zhang X, Fuehres H, Gloor PA (2011) Predicting stock market indicators through twitter “I hope it is not as bad as I fear. Procedia Soc Behav Sci 26:55–62
Zhang Y, Swanson PE, Prombutr W (2012) Measuring effects on stock returns of sentiment indexes created from stock message boards. J Financ Res 35(1):79–114
Zheludev I, Smith R, Aste T (2014) When can social media lead financial markets? Sci Rep 4(1):1–12
Zhou H, Pan Q (2016) Information, community, and action on Sina-Weibo: how Chinese philanthropic NGOs use social media. VOLUNTAS 27(5):2433–2457
Zhou Z (2014) Data crawler for Sina Weibo based on Python. J Comput Appl 34(11):3131
Zimmer, M (2020). “But the data is already public”: On the ethics of research in Facebook. In The ethics of information technologies. Routledge, pp 229–241
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT: Ministry of Science and ICT) (No. 2022R1G1A1012604).
Author information
Authors and Affiliations
Contributions
Conceptualization, methodology, investigation, data curation, writing—original draft preparation, Qing Liu, writing—review and editing, supervision, project administration, funding acquisition, Hosung Son.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, Q., Son, H. Data selection and collection for constructing investor sentiment from social media. Humanit Soc Sci Commun 11, 786 (2024). https://doi.org/10.1057/s41599-024-03316-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1057/s41599-024-03316-7