Introduction

Around 2001, it seems to have marked the beginning of a widespread awareness of the research issues and opportunities presented by sentiment analysis and opinion mining. A number of prospective studies emerged concerning investor sentiment in social networks (Cardie et al. 2003; S. Das and Chen, 2001; Dave et al. 2003; Dini and Mazzini, 2002; H. Liu et al. 2003; Morinaga et al. 2002; Nasukawa and Yi, 2003; Tateishi, 2001). Subsequently, hundreds of papers have been published on this topic in a very short period of time. Pang and Lee (2008) suggested that there are three potential reasons behind this “land grab” phenomenon:

  • The rise of machine learning methods in the fields of natural language processing and information retrieval.

  • The availability of data for research is due to the boom of the World Wide Web, and in particular, the growth of social media.

  • Fascinating application prospects.

The emergence of social media has provided researchers with a new, wealthy, and easily accessible data source about individuals, societies, and even the world at large (Schoen et al. 2013). In the era of big data and artificial intelligence, social media sentiment becomes an important complement to traditional sentiment proxy approaches based on survey research, market indicators, and search engines (Antweiler and Frank, 2004; Da et al. 2015).

The study of social media-based investor sentiment has been one of the focal issues in behavioral finance research in the last two decades, along with the development of the Internet and artificial intelligence (Chen et al. 2014; Cookson and Niessner, 2020; Das et al. 2005; Gilbert and Karahalios, 2010; Liu et al. 2022; Mai et al. 2018). Many scholars have studied the relationship between investor sentiment and stock markets based on sentiment in social media (Liu et al. 2022; Xiong et al. 2017). However, the findings reported by researchers are inconsistent; for example, some researchers report that the sentiments from social media have no predictive ability (Antweiler and Frank, 2004; Tumarkin and Whitelaw, 2001), while others describe weak or strong predictive ability (Liu et al. 2022; Mao et al. 2011). Thus, research on social media sentiment remains an open question (Nguyen and Shirai, 2015).

Mao et al. (2011) pointed out that all of the findings about investor sentiment in social media were based on unique datasets and combinations of tools for tracking sentiment. Scholars have extracted investor sentiment from social media in different and widely varying ways (Long et al. 2021). Yet there are no accepted guidelines for measuring social media investor sentiment, and social media research is still on the path to establishing methodological standards and/or best practices (Mayr and Weller, 2017). This has called into question the reliability of social media investor sentiment (Mao et al. 2011). The reliability of sentiment mining tools has often been considered the most important aspect of ensuring the reliability of investor sentiment in social media in historical studies (Kušen et al. 2017; Q. Li et al. 2018; Loughran and Mcdonald, 2016). Instead, descriptions of data sources and collection are often passed over (Eierle et al. 2022; Gan et al. 2020; Jiao et al. 2020).

As highlighted by See-To and Yang (2017), reliable datasets and the tools for accurate measurement are two of the challenges in investigating social media sentiment. As shown in Fig. 1, ensuring that sentiment mined from social media can be used in finance research needs to be tested against two fundamental questions.

Fig. 1: A framework for reviewing the reliability of social media investor sentiment.
figure 1

IS is the abbreviation for investor sentiment. FM is the abbreviation for financial markets. Q1 (\({{\boldsymbol{IS}}}_{{\boldsymbol{Real}}}{\boldsymbol{\leftrightarrow }}{\boldsymbol{FM}}\)) describes whether the textual sentiment of the dataset is causally related to the financial market. Q2 (\({{{\boldsymbol{IS}}}_{{\boldsymbol{Real}}}{\boldsymbol{ \sim }}{\boldsymbol{IS}}}_{{\boldsymbol{Measured}}}\)) describes whether the textual sentiment of the dataset is accurately measured.

Q1: Is there a relationship between the data collected from social media and the financial market under study?

Q2: Is the researcher parsing real investor sentiment from social media data?

The combination of these two questions ensures that the researcher has obtained appropriate social media data and parsed reliable investor sentiment from it. This is highly consistent with Mao et al. (2011), who highlighted that “findings on social media-based investor sentiment are dependent on a unique, specific combination of datasets and sentiment tracking tools.” As we have explained earlier, however, the reliability of the dataset is the foundation and prerequisite for the reliability of investor sentiment. However, in related studies, scholars have tended to describe social media data in an often open-ended and inadequate manner.

In this study, we systematically investigate the process of selecting and collecting data for finance research based on investor sentiment in social media and summarize the basic recommendations that should be followed. We show that the process of creating a dataset is a struggle and a compromise with noise and error, and that all we have to do is keep the error or noise to a low, acceptable level. This survey has significant implications for the measurement of investor sentiment in social media, behavioral finance, and social media mining.

The subsequent sections of this paper are organized as follows: Section 2 is the methodology, describing the core literature sources and the framework of the review. Section 3 presents the results, describing the main review findings in terms of data sources, technical framework, and limitations of the dataset. Section 4 discusses research insights and limitations. And finally, there is the conclusion.

Methods

Core reference list

To systematically review the methods of constructing datasets from social media and propose a general framework for dataset construction, we conducted a series of key literature searches. To ensure the impact and timeliness of the article, we performed Google ScholarFootnote 1 searches under two conditions: without time restrictions and with a restriction from 2020 to the present:

  • Without time restrictions, we searched using keywords such as “social media” and “investor sentiment” and sorted the results by relevance and citation count. We obtained a total of 420 highly-ranked literature records.

  • Under the condition of limiting the publication year to 2020 to the present, we used the same keywords for the search and sorted the results by relevance and citation count. We successfully retrieved 80 literature records.

    After removing duplicate studies, we screened 474 studies. Among these, 291 studies were excluded for not meeting the inclusion criteria, primarily based on the results of title and abstract reviews. Subsequently, the full texts of the remaining 183 studies were reviewed, of which 85 met the inclusion criteria. The main review criteria were as follows:

  • The study uses social media text for sentiment analysis.

  • The study explores the relationship between investor sentiment and financial markets.

  • The paper includes a detailed description of the dataset source, sentiment mining from text, and the construction of sentiment indices.

Based on the above screening method, we compiled and obtained a list of 85 core reference papers, which can be found in Appendix A. Figure 2 illustrates the selection process for the studies included in this review.

Fig. 2: Flow diagram of studies included in the review.
figure 2

This figure depicts our process of screening the core literature from 500 references, where n represents the sample size.

Among the 85 core reference papers selected, some studies utilized data sources such as StockTwits, Yahoo! Finance, and Eastmoney BBS, which are specialized electronic communities for investors, rather than general social media platforms. Nonetheless, these platforms still possess the fundamental characteristics of social media and have been widely used in sentiment analysis research related to investor sentiment (Chen et al. 2014; Gloor et al. 2009; Liu et al. 2022).

Review framework

Research question (Data source to answer the research question)

Borgman (2012) pointed out that researchers use various methods to collect data for multiple purposes. When conducting research, it is crucial for researchers to ascertain whether there is sufficient data to support their research questions (Castleberry and Nolen, 2018). Like all types of research, studies based on social media require first defining a clear research question and then obtaining relevant data to address this question (Mayr and Weller, 2017).

Data collection

Once the data source is determined based on the research question, data collection becomes another critical dimension to consider, as highlighted by Castleberry and Nolen (2018). Data collection involves the data time frame, data quantity, and techniques required to address the research question. Mayr and Weller (2017) summarized these issues as the technological framework upon which data collection relies.

Limitations

Researchers first identify the data source based on the research question and then seek a method (i.e., a technological framework) that comes as close as possible to an optimal dataset for data collection (Mayr and Weller, 2017). However, researchers must acknowledge the limitations and deficiencies of the collected data, which may stem from legal, ethical, and technical aspects (Castleberry and Nolen, 2018).

In summary, historical research emphasizes three core dimensions to consider when constructing a dataset: research questions, data collection, and validation of limitations, as shown in Fig. 3. Among these, research questions directly or indirectly determine the technical framework for data sources and collection methods. After obtaining the dataset, it is essential to conduct a validation of limitations based on the research questions to fully understand the data’s constraints and its applicability in addressing the research questions.

Fig. 3: Review framework.
figure 3

This figure depicts the three main dimensions disclosed in the historical literature for constructing social media datasets. The review framework for this study was constructed based on these three dimensions.

Results

Following the investigative framework in Fig. 3, we systematically reviewed the process of selecting and collecting social media sentiment data from three dimensions: data source, technological framework, and limitations. All three dimensions revolve around the research question. In behavioral finance research based on social media investor sentiment, if the real sentiment (\({{IS}}_{{Real}}\)) in the social media dataset used fails to have an impact on a particular financial market (\({FM}\)), then the conclusion of invalidity has been established before the research begins. Researchers must carefully examine what data they need to collect and use and be cognizant of the expected impact that particular data collection methods may have on analytical methods and research findings (Mayr and Weller, 2017).

Data source

Social media

Table 1 provides statistics on the sources of traditional social media used to mine investor sentiment (see Appendix A for detailed data). The survey data shows that 35.5% of the researchers used Twitter data, 2.6% of the studies used Sina Weibo data, and 2.6% of the studies used Facebook data. Weller (2015) stated that Twitter and Facebook were the most frequent social media platforms for academics to study social media sentiment. Our findings are slightly different from Weller’s (2015) but are in line with the current findings. According to the global-social media ranking (2022)Footnote 2 based on disclosed monthly active user counts, the most popular social media platforms are as follows: Facebook (2.91 billion), Instagram (1.478 billion), Sina Weibo (573 million), Snapchat (557 million), and Twitter (436 million).

Table 1 Survey table of daily social media data sources.

However, as people mainly use Snapchat and Instagram for following entertainment, fashion, and sharing personal content (Phua et al. 2017; Vaterlaus et al. 2016). So global-social media platforms suitable for mining investor sentiment include Facebook, Sina Weibo, and Twitter. Twitter and Facebook are social media sites allowed by the US Securities and Exchange Commission (SEC) for companies to use (Gallu, 2013). So, it seems likely that what people do on social media could affect how well a company’s stock does (L. Liu et al. 2015). Sina Weibo has been the most important microblogging service provider in China since its launch in 2009 (Cheng and Lin, 2013), so studies based on Sina Weibo have mainly focused on the Chinese market.

More than 1/3 of researchers use Twitter to study social media investor sentiment, making Twitter the most popular medium for investor sentiment mining. Twitter’s application programming interface (API) provides an easy-to-use application programming interface that allows researchers to easily access data. Users can tag relevant topics (using the “#keyword” syntax) or specific stocks (e.g., “$GOOG” for Google) via a generic API (T. Li et al. 2018; See-To and Yang, 2017). The dollar sign (“$“) filter, also known as the “cashtag,” is a flag for investors to share their investment-related opinions on a specific stock. This cashtag was promoted by StockTwits, a platform dedicated to sharing stock-related opinions (See-To and Yang, 2017). The Twitter-independent Trendogate app allows users to query Twitter data using location, date, and subject tags (Nisar and Yeung, 2018). Studies have also shown that Twitter’s online sentiment can be better for financial phenomena; for example, Naeem et al. (2021) showed that a Twitter-based sentiment proxy approach outperformed a Google-based online investor sentiment proxy. As can be seen, the sufficient amount of data, common API, easy retrieval, and data targeting make Twitter the most popular platform for investor sentiment analysis.

Facebook has the largest social media user base (Wilson et al. 2012). Users can write their status updates in a box containing open-ended questions such as “What are you thinking?” This has allowed scholars to visually assess users’ emotions (Siganos et al. 2017). However, in October 2011, Facebook began to stop reporting information from Argentina, Australia, Austria, Canada, Ireland, New Zealand, Singapore, and South Africa (Siganos et al. 2017). In addition, Facebook users use the platform more for self-promotion than Twitter (Java et al. 2007). This is probably why Facebook has the largest user base but is not the most popular source of investor sentiment data.

Since the withdrawal of Twitter from the Chinese market in 2009, several local social media companies have launched similar services that have quickly gained popularity. Sina Weibo, launched on August 14, 2009, is the largest microblogging platform in China, with over 100 million messages posted daily (Y. Sun et al. 2020). As of January 2022, Sina Weibo had 573 million usersFootnote 3. If the study is limited to the Chinese market and microblog-based emotional agents, Sina Weibo seems to be the only option (Y. Sun et al. 2020).

eCommunity

Investor e-communities are another major vehicle for scholars to study investor sentiment on social media. As Table 2 shows, investor e-communities have a richer selection than everyday social media. StockTwits, Yahoo! Finance, and Eastmoney BBS were used as data sources in 42.1% of the cases, and these three investor e-communities, along with Twitter, are the primary vehicles for scholars to study social media investors.

Table 2 Survey table of investor eCommunity data sources.

Many studies have concluded that electronic communities are better suited as vehicles for investor sentiment than everyday social software (Deng et al. 2018; J. R. Piñeiro-Chousa et al. 2016). Because they are specifically about investing, they result in less noise than datasets collected from more general microblogging services (Oliveira et al. 2013). Nguyen et al. (2015) showed that the primary reason they did not choose Twitter as a source of sentiment was that the messages on Twitter appeared to be more confusing than those on message boards.

StockTwits is a microblogging site developed by the Twitter team for investors and investment professionals to share information and ideas about the financial markets (StockTwits, 2014). Similar to Twitter messages, each post on StockTwits is limited to 140 characters. Individuals, investors, market professionals, and listed companies can post on it, so StockTwits can be used to “tap into the pulse of the market” (Renault, 2017). StockTwits has the following advantages:

  • Since September 2012, StockTwits has implemented a new feature that allows users to directly select the sentiment they want to express when posting messages on the platform, such as bullish or bearish (Renault, 2020). This allows academics to be spared the hassle of text parsing.

  • StockTwits invests a lot of effort in filtering out non-financial-related messages and spam. High-quality and large-scale text data are provided for investor sentiment mining (A. Sun et al. 2016).

  • Like Twitter, StockTwits offers an easy-to-use API and the ability to locate desired messages by symbols such as the dollar (“$“), which predate even Twitter’s adoption (See-To and Yang, 2017).

    These advantages make it the most researched electronic community for investors (J. R. Piñeiro-Chousa et al. 2016).

    Yahoo! Finance is another electronic community that has received extensive attention from researchers (Gu et al. 2007). Especially until 2010, Yahoo! Finance was the most important vehicle for scholars to study sentiment on electronic message boards (see Appendix A). As shown in Table 2, 14.5% of studies used Yahoo! Finance as a vehicle for social media investor sentiment mining, second only to StockTwits. This is consistent with what Das and Chen (2007) and Sabherwal et al. (2011) reported. In recent years, however, there seems to have been a growing preference among scholars to use StockTwits and Eastmoney BBS to investigate investor sentiment.

    Eastmoney BBS is the most visited and influential financial portal in China. It leads the industry in terms of effective browsing time, core web traffic, and average daily visits (Ackert et al. 2016). Eastmoney BBS is characterized by high information content, short texts, and rich emotions (Q. Liu et al. 2023, 2024; Y. Sun et al. 2018). In the absence of Twitter and StockTwits in the Chinese market, Eastmoney BBS has become the main vehicle for studying social media sentiment in the Chinese financial market (Y. Li et al. 2020; Y. Sun et al. 2020).

    As shown in Table 2, there are many other electronic communities that are used for social media investor sentiment mining, but StockTwits, Yahoo! Finance and Eastmoney BBS are among the most important data sources. Some of these data sources listed in Table 2 no longer even exist, such as Reuters 3000 Xtra, an electronic trading platform released by Reuters in 1999 and supported until the end of 2013. It was typically used by professional traders and financial analysts in their trading rooms. It was replaced by the Eikon platform, which was first released in 2010Footnote 4.

    With regard to the choice of data source, there is also the issue of using a single data source versus a composite data source. In order to study a particular social media platform in depth, it is certainly appropriate to use a single data source approach (Boyd and Crawford, 2012). Sometimes, however, comparisons with other platforms are necessary in order to demonstrate whether the observed phenomena are unique to that data source or a feature shared by multiple data sources. Quan-Haase and Young (2010) showed the value of comparing multi-platform data sources.

    Thinking about the ideal dataset should of course also include asking whether social media really provides the best data source—or whether there are other, more desirable options, such as experiments, survey data, and content from traditional mass media (Mayr and Weller, 2017). Having received a positive answer, the first question to ponder is that of data sources, i.e., which social media platforms are most relevant to the research question? In the absence of general guidelines, academics often need to answer the question “Why is the data source the best?” themselves. In summary, we highlight the basic principles for describing data sources:

    Principle D1:

    A clear disclosure of the data sources used for the study and the reasons for their selection can help lend credibility to the work. The basic elements that should be elaborated on include:

  • What data source did we use?

  • What are the characteristics of the data source chosen?

  • Why is this data source most relevant to our work?

Technology framework

Having identified the data source, we need to develop a technical framework to ensure that the ideal dataset can be collected to answer the question under study. Referring to Borgman (2012) and Mayr and Weller (2017), the technology framework should contain three basic dimensions: how the data will be collected, the time frame for data collection, and how much data will be collected.

How to collect data

In our survey, 77.6% of the researchers used Twitter (35.5%), StockTwits (18.4%), Yahoo! Finance (14.5%), and Eastmoney BBS (9.2%) as data sources. Our survey found that the method of data collection was highly dependent on the type of data source, as shown in Fig. 4. Twitter provides an easy-to-use API (Hamraoui and Boubaker, 2022; Java et al. 2007), so most studies used the official API to collect data. There are also platforms available for purchasing historical and current public tweets or third-party tweets (See-To and Yang, 2017), for example, Gnip (http://gnip.com/twitter). StockTwits also provides an official data API (Guégan and Renault, 2021), which is used in a similar way to Twitter (See-To and Yang, 2017). However, its popularity is not as high as that of the Twitter API. Taken together, official APIs, crawlers, and third-party datasets are the main ways to obtain data from data sources. In addition, a large number of studies do not disclose their data acquisition methods.

Fig. 4: Survey statistics of data collection methods.
figure 4

The data in this figure comes from the statistics in the “Creation Methods” column in Appendix A.

The specific way in which data are collected determines the content of the data collected and they are highly dependent on the question being studied. Researchers using Twitter as a data source can locate the desired content by topic, keyword, or money match (Audrino et al. 2020; Rao and Srivastava, 2012b). On the other hand, Sina Weibo researchers lack the support of an official API and easy content retrieval notation and often focus on specific user accounts or keyword searches to get the content that the data wants to match with the research question (S. Chen et al. 2011; H. Zhou and Pan, 2016).

In addition, many studies require more detail about the data, such as the location of the user who created the data. Vu et al. (2012) wanted to control the market sample in the US in their study, so they geographically focused their Twitter queries on four major cities (New York, Chicago, Los Angeles, and San Francisco). Nisar and Yeung (2018) described how location queries can be used on Twitter. Giannini et al. (2018) note that sentiment in non-local Twitter posts has a very different impact on future reporting than sentiment in local Twitter posts, a phenomenon that Huang et al. (2016) refer to as “hometown bias.” These studies for specific geographic locations also pose new requirements for data collection methods.

Social media texts contain multiple dimensions, and the researcher needs to determine what specific content to use to match the research question. For example, title, author, posting time, number of reads, comments, and body text. For the Weibo data, additional details on the followers of the Weibo account are included (Z. Wang et al. 2020). Which of these elements is needed in the study, and how will these elements have an impact on the findings? These are questions that need to be considered and determined.

In summary, the basic way of collecting data, the specific path, and the detailed content are the fundamental questions facing the researcher, provided that the data sources are identified. Furthermore, it must be remembered that all this expertise must also evolve as social media platforms and their APIs change. Researchers need to think about the ideal solution in terms of the research question in order to find the best dataset. In the literature we surveyed, 37.3% of the work did not disclose the basic methods of data acquisition, which can lead to confusion for the reader. A second principle regarding datasets is presented here:

Principle D2:

Clear disclosure of the basic methods, specific pathways, and content details of data collection helps to make the work more convincing. Its concerns are:

  • In what ways do we acquire data?

  • How do we filter and target the ideal content?

-What specific content does a basic piece of data contain?

Time frame

Time is a fundamental dimension of data collection, and, as with the choice of data source, the time frame of data collection needs to help answer the question under study. Mayr and Weller (2017) highlighted that we need to consider the time frame in conjunction with the underlying strategy of the data collection setting.

Our survey found that 97.4% of researchers agreed that it was important to disclose the temporal dimension of the data and disclosed the basic data time horizon in their studies (see Appendix A). 55.3% of the work in these studies disclosed the year-month-day interval of the data. For example, Kaminski (2016) used Twitter data from 2013.11.23 to 2014.03.07 to analyze the correlation and causality between bitcoin market indicators and Twitter posts. 34.2% of the work disclosed the year-month interval of the data. For example, Wu et al. (2014) used Sina Finance data between July 2012 and November 2012 to report a strong correlation between stock price volatility trends and stock forum sentiment. 7.9% of the work disclosed the year interval used in the study. For example, Audrino et al. (2020) used a novel and extensive dataset of Twitter, StockTwits, and search engine data from 2012 until 2016 to analyze the impact of sentiment and attention variables on stock market volatility.

Jungherr et al. (2012) argued that it is not enough to simply disclose the time frame used in the study; the researcher must be clear that the time frame chosen may significantly affect the results of the study and articulate their rationale for choosing this time frame. Kranefuss and Johnson (2021), for example, avoided the impact of market shocks on the findings of the study through the choice of time frame. In addition, the choice of time frame becomes even more necessary for research studies that are set in the context of social events. Some examples are the effect of COVID-19 on investor sentiment (Chen et al. 2020; Huynh et al. 2021) and the effect of political elections on investor sentiment (Oyebode and Orji, 2019; Shaikh, 2019).

If there is no suitable reason to choose the necessary time frame, then there is no reason why we should not choose the most recent data. Some studies have shown that investor sentiment in social media is not the same across time (Heston et al. 2010; Wei et al. 2016). Choosing the most recent data helps to obtain conclusions that are consistent with the present (Liu et al. 2022). If the data are too old in time, they may reach a conclusion that was once correct but may not be accurate now.

Based on the above analysis, a third basic principle is proposed regarding the dataset.

Principle D3:

  • Scholars should clearly disclose the time frame used for the study and articulate why that time frame is conducive to the response to the research question.

  • If the research question does not have a specific requirement for a time frame, using the most recent data is beneficial in obtaining conclusions that are consistent with the present.

  • If the research question does not specifically require a time frame and the scholar is unable to use the most recent data, it is necessary to make a statement as to why—for example, a data access dilemma.

How much data to collect

Gloor et al. (2009) classified the information domain of the World Wide Web into three categories. They argued that the whole web represents the “wisdom of crowds,” blogs represent the “wisdom of experts,” and forums represent the “wisdom of swarms.” Deng et al. (2018) and Wei et al. (2016) argued that sentiment on social media reflects the “wisdom of crowds” or “collective wisdom.” Bollen et al. (2011) argued that although each tweet represents an individual opinion, the aggregate sample should accurately reflect public sentiment. According to the phenomenon of “group wisdom,” since each tweet represents an individual opinion, the aggregate sample should accurately reflect public sentiment. According to the phenomenon of “group wisdom,” a large group of people performs better than a small group or minority (Eierle et al. 2022; Nofer and Hinz, 2015). These studies show that research that is based on social networks needs a large enough amount of data.

Due to the lack of general guidelines, researchers need to find out for themselves the amount of data needed to answer specific research questions (Mayr and Weller, 2017). Ackert et al. (2016) used data from the investor community to examine the predictive power of investor sentiment on investor returns. In the study, Ackert et al. (2016) removed stocks that have a low volume of social media texts, arguing that the smaller volume of data may bias the study sample. Tumarkin et al. (2002) emphasized that only stocks with sufficient social media data should be used in stock market research based on social media sentiment, and Garcia (2013) also showed that a sufficiently large data size gives the study more statistical power. It is clear that sufficient data volume is crucial for social media-based research.

As shown in Table 3, the median size of the data used by the researchers in our survey was 1,109,500, while the mean was as high as 23,643,300. As a result, there is reason to believe that sentiment mining in social media is a big data application. However, there are some reports that the impact of additional data on research is marginal as the magnitude of the data increases. For example, Renault (2020) found that prediction accuracy was stable at 250,000 to 500,000 messages and increased by only 0.31 percentage points when the size of the dataset increased from 500,000 to 1 million messages. Larger data volumes lead to greater workloads and study difficulties, but no study has reported that larger data volumes negatively affect the results.

Table 3 Survey table of investor eCommunity data sources.

Based on the above elaboration, the fourth principle regarding datasets is proposed.

Principle D4:

It is necessary for scholars to disclose the volume of data used in the study and to explain “Why this level of data is available to answer the question being studied?”

Limitations

Scholars start with a research question and think about the ideal data source and data specification needed to answer it. Yet the ideal dataset may not be achievable due to technical (Borra and Rieder, 2014), policy, legal, or ethical constraints (Bishop and Gray, 2017). In crossing these barriers, necessary compromises have to be made (Mayr and Weller, 2017; Nguyen et al. 2015; Y. Sun et al. 2020; Toivonen et al. 2019).

Barriers of law and ethics

In early 2016, the information of 70,000 users of OK Cupid, a dating social media platform, was made available by a researcher in a public database (Metzler et al. 2016). This led to a public outcry and a court case, but it did not change the fact that it was “already public” (Zimmer, 2020). A central debate is whether social media data are public, and if so, does this mean that they can be disseminated without restriction (Bishop and Gray, 2017)? Even for the sake of open science, social media data access is still subject to legal and ethical scrutiny, and carefree data disclosure challenges respect for privacy (Ess and Jones, 2004).

Limitations of policy and technology

Researchers are intrinsically and strongly motivated by the access to and sharing of social media data (Weller, 2015). As data provision platforms, social media platforms have to impose policy and technical restrictions on access to data in order to meet legal and ethical requirements (Borra and Rieder, 2014). For example, the Twitter API only supports real-time crawling of tweets as they are posted; it is not possible to retrospectively search for data on topics related to past events (so-called “historical” Twitter data) (Weller, 2015). This makes it difficult for researchers to collect large amounts of data from Twitter when it would take a year to collect tweets over the course of a year. Under current policies, Twitter data requesters are only able to access 1% of public data (Rao and Srivastava, 2012a; X. Zhang et al. 2011), and researchers have to accept the statistical error that this may cause (Phillips and LaPole, 2003; Robertson, 1999). Zhang et al. (2011) show that researchers were able to access 1% of overall Twitter sentiment through an API interface. If researchers were to partially address the above situation, they may need to pay additional fees to purchase public tweets (See-To and Yang, 2017).

Crawling techniques are an important way for researchers to obtain social media data, especially when acquiring data from the investor community. Crawlers were used in 65.2% of the disclosed studies based on investor communities (see Appendix A). However, many social networking sites use different methods to protect their data (Z. Zhou, 2014). When using crawling techniques to crawl data, we must consider the issue of limiting the speed of crawling and how to address the anti-crawling strategies used by these sites (H. Wang et al. 2018).

Limitations of the collection paths

Many methods of collecting data create dataset-specific biases (Bruns and Stieglitz, 2014). Lorentzen and Nolin (2017) reported on the limitations arising from label-based data collection methods. The core keywords used to locate the data are among the most important issues that can directly affect the content and magnitude of the data obtained (Zheludev et al. 2014). For example, Rao and Srivastava (2012a) used Twitter data to study stock market volatility when they collected data from June 2010 to July 2011. The volume of data was 4,025,600. However, in another study, they collected data for a longer period of time (June 2010–September 2011) and only located 1.964 million records (Rao and Srivastava, 2012b).

In addition, due to the large amount of data available on social media, in some cases, only a portion of it could be used for the study. The sampling method used to collect the data may introduce additional errors (Abdesslem et al. 2012). In conclusion, researchers must be very aware of the potential limitations of their collection methods and should outline the consequences this may have on the results obtained (Mayr and Weller, 2017).

Discussion

We systematically reviewed the issues of data selection and collection in historical research from the perspectives of data sources, technical frameworks, and limitations. However, it is crucial to recognize that obtaining a perfect dataset to address our research questions is challenging. In this section, we will further explore these reasons, summarize the insights gained from the study, and discuss its limitations.

Imperfect datasets

Social media data collection is a process that starts with the research question and ends with the research question. We start with a research question, think about the ideal data source and data framework, and then cross the barriers to get the desired data. In the process, we have to make the necessary compromises to get as ideal a dataset as possible. We then have to use this imperfect data to investigate the research question and recognize the biases involved.

Undesirable time frame

One of the signs of a non-desirable time frame is that data for the desired years is not available. Zhang and Skiena (2010) studied the impact of news on company stock performance using a broad data source of blogs and news from 2005 to 2009. However, the data they collected included:

  • data from Twitter in 2009;

  • data from Spinn3r RSS Feeds (a global blog collection) from 2008 to 2009;

  • data from LiveJournal from 2006 to 2008.

It is hard to figure out how the different time periods of these data might have affected the study’s results.In the study of Audrino et al. (2020), data from January 2014 as well as January and February 2015 were missing, thus making the study data discontinuous. They had to use neutral sentiment to fill in the missing data and figure out how that affected the results of the study.

The undesirable time frame is also evidenced by the staleness of the dataset. Table 4 counts the relative year difference (RYD) between the time of publication of articles and the time stamp of the social media dataset used for the study for 74 publications. The RYD reflected the newness of the dataset used for the study, with the 75% quantile of the RYD being 4 years, indicating that a quarter of the studies used data that was 4 years old or even older. The maximum value of RYD is 8 years.

Table 4 Relative year-difference statistics for social media data used in the reviewed studies.

Undesirable volume of data

We have already discussed that emotions on social media reflect “crowd wisdom” or “collective wisdom” (Deng et al. 2018; Wei et al. 2016). Emotions on social media are a collection of numerous individual emotions. As shown in Table 3, the median size of the data used by researchers in our survey was 1,109,500, while the mean was 23,643,300, with only a quarter of the studies using less than 178,900.

If the amount of data in a study is grossly inadequate due to various barriers in data creation, it can affect the credibility of the work. We have observed many such examples, for example:

  • Chua et al. (2009) used HotCopper ASX data from the first half of 2004 to study the sentiment classification of the investor community. The amount of data used by Chua et al. (2009) is 8307 posts. However, Renault (2020), in a similar study, noted that the classification performance stabilized at a data volume of 250,000 to 500,000 messages.

  • Nofer (2015) used social media data to study stock return forecasting using over 5 years of investor community data between May 2007 and August 2011, with only 10,146 text messages.

  • Rizkiana (2021) collected tweet data from Stockbit for five stocks from April 10, 2017 to December 1, 2017. The average number of tweets per day for each stock is less than 2.

  • With a data volume of only 856, Hamraoui and Boubaker (2022) collected 1 year of tweets for 81 stocks. Yet it contained 81 stock tweets for 1 year.

When the magnitude of the social media data deviates significantly from the concept of “mass wisdom” or “collective wisdom,” the reader is forced to question the conclusions of the study.

Limitations of third-party datasets

An open-ended study usually starts with a research question. In order to answer the research question, the researcher must cross various barriers to obtain the ideal dataset. And when they anticipate not being able to cross these barriers or encountering difficulties, they may consider using an off-the-shelf dataset. As a result, they would begin their research not with the problem itself but with an existing dataset (F. Li, 2010). At this point, they are both supported and limited by the nature of the data sources available (Schroeder, 2014).

Other noises

There may also be unexpected obstacles to overcome during the data acquisition process. These barriers may lead to additional noise. For example, when Nofer and Hinz (2015) used Twitter data to measure sentiment levels in Germany, they found that only 1% of all tweets were written in German, while 50% were written in EnglishFootnote 5.

They believed it was reasonable to consider both English and German tweets and translated all tweets into English. This process generates additional data noise. Also, even the same language is represented differently in different countries, and this can cause errors in the later sentiment classification. For example, Gouws et al. (2011) noted a significant difference in word abbreviation behavior between UK and US microtext authors.

As shown in Fig. 5, we start with a research question and go about selecting the ideal data source to answer that question. We then have to develop a data technology framework to obtain the ideal dataset. In the process of collecting data, we need to overcome various obstacles and end up with an imperfect dataset. We have to find a way to approach the best dataset and acknowledge some shortcomings and limitations. It is important to always envision the ideal dataset and then reduce it to the best dataset given the current limitations (Mayr and Weller, 2017).

Fig. 5
figure 5

A diagram of the dataset creation process.

In summary, we propose the fifth principle regarding datasets.

Principle D5: Due to legal, ethical, and technical barriers, we had to make compromises in the collection of data to obtain a dataset that was close to ideal.

We must acknowledge the shortcomings and limitations of the collected datasets and assess the impact of these shortcomings on the results of the study.

Research implications and limitations

This study carries three significant implications. Firstly, we have introduced a comprehensive guiding framework, providing a theoretical foundation for the collection of financial textual data from social media, as shown in Table 5. This framework helps address the inconsistent data quality issues faced by current behavioral finance research. We emphasize that the absence of accepted, reliable standards makes it particularly important to follow basic principles.

Table 5 The basic principles of measuring investor sentiment in social media.

Secondly, the reliability of investor sentiment extracted from social media is a crucial condition to ensure the reliability of the research findings. In historical studies, sentiment mining has often been regarded as the most critical step in constructing investor sentiment (Kušen et al. 2017; Q. Li et al. 2018; Loughran and Mcdonald, 2016), while the aspect of data sources has sometimes been overlooked or simplified (Eierle et al. 2022; Gan et al. 2020; Jiao et al. 2020). This study helps fill this gap in current research.

Lastly, there is a significant inconsistency in the research conclusions regarding social media sentiment. In this emerging field, it’s challenging to distinguish clear relationships between measurement and tool effects and actual online sentiment and market value. This makes sentiment surveys based on social media an unresolved open issue (Nguyen and Shirai, 2015). This study holds significant practical implications for enhancing the reliability of investor sentiment measurement and the reproducibility of historical research results.

However, this study has its limitations. Research on investor sentiment based on social media has consistently been a focal point in behavioral finance studies. Given the limited sample size, we were unable to conduct a fully comprehensive survey. However, we have indeed provided a meaningful cross-sectional investigation. Additionally, due to space constraints and research capabilities, some indicators related to investor sentiment were not included in the review scope. For example, there remains insufficient data on the investigation of investor disagreements.

Conclusion

The advent of social media has provided researchers with a new, rich, and easily accessible source of data about individuals, society, and even the world at large (Schoen et al. 2013). Research based on investor sentiment in social media has been one of the focal issues in behavioral finance research for nearly 20 years (Cookson and Niessner, 2020; S. Das et al. 2005; Gilbert and Karahalios, 2010; Mai et al. 2018). It has also been accompanied by researchers’ concerns about the reliability of investor sentiment on social media (Nguyen and Shirai, 2015).

In historical research, sentiment mining is often considered the most important aspect of constructing investor sentiment from social media (Kušen et al. 2017; Q. Li et al. 2018; Loughran and Mcdonald, 2016), while the part on data source selection and data collection is often overlooked (Eierle et al. 2022; Gan et al. 2020; Jiao et al. 2020). Yet the reliability of data sources is a prerequisite for mining reliable data sources from social media. In this study, we systematically reviewed the process of selecting and collecting social media data for the construction of investor sentiment and provided systematic recommendations on key questions. The key questions on which these recommendations focus are shown in the table. We show that following these basic recommendations will enhance the reliability of the study and be important for both the replication and comparison of the research findings.

Scholars start with a research question and select the ideal data source to answer that question. Scholars need to establish a data technology framework to obtain the ideal dataset. However, there are still various barriers to overcome during the acquisition and collection processes, resulting in an imperfect dataset. From the beginning, this work has been a process of struggle and compromise with noise and error. We show that following these basic recommendations can improve the quality of the dataset, thereby reducing noise in the data source and increasing the reliability and credibility of the study.

The measurement of investor sentiment based on social media presents promising research prospects, yet it also confronts various challenges, with the establishment of data sources being a pivotal step in addressing these challenges. Through a review of historical studies, we have strived to conduct a comprehensive examination to gain a deep understanding of the entire process of measuring investor sentiment based on social media. However, in this rapidly evolving research field, updating and conducting systematic reviews with larger sample sizes remain of paramount importance. Additionally, due to space and research capacity limitations, we were unable to include an investigation related to investor disagreement, which is also a significant factor in shaping investor sentiment. Therefore, future supplementary research can be conducted based on investor disagreement. Finally, large language models (LLMs) have had a significant impact on both the generation of social media data and the construction of investor sentiment, which is a topic for further discussion in the future.