Introduction

The COVID-19 pandemic has led to increased xenophobia and racism towards Chinese communities (Lee and Huang, 2021; Zhang, 2021). China, being the first country to report cases of the coronavirus, has been the subject of misinformation regarding the origin of COVID-19, which has fueled a global surge in Sinophobia (Cook et al., 2021). Recent scholarship has examined public sentiment towards China and the Chinese government, with one strand of research analyzing how social media users and media outlets framed China during the early stages of the pandemic. For instance, Cook et al. (2021) found that the pandemic led to a sharp rise in anti-China attitudes in the United States, based on an analysis of English-language tweets. Similarly, Fan and Zhang (2023), analyzing web news on China from media outlets worldwide, found a significant increase in racial slurs targeting China during the early pandemic that persisted even after the World Health Organization warned against misinformation about COVID-19. Meanwhile, there is another strand of literature assessing how Chinese citizens responded to COVID-19 on domestic social media platforms. For example, Lu et al. (2021) found that Sina Weibo users in China were more supportive than critical due to the effective COVID-19 responses by the Chinese government. It is unclear, however, how Chinese language users on Western social media platforms like Twitter discussed COVID-19 and their sentiments towards China, as previous studies have focused either on English social media users or Chinese domestic users in a censored environment.

Recent research has shown that the COVID-19 crisis increased censorship circumvention and access to international news and political content on websites blocked in China (Chang et al., 2022). When individuals seek crisis-related information, they may also come across unrelated information or misinformation that has long been censored by the government. These users may actively engage in social media conversations and increasingly influence public opinion in international society. There is reason to expect positive sentiments toward China among Chinese-language users due to the potential political propaganda by the Chinese government and cyber-nationalists. However, since China has censored most international social media platforms (Hobbs and Roberts, 2018), Chinese-language users on Twitter may represent a very selective group, such as overseas Chinese, residents from Hong Kong, Taiwan, and Singapore, Mainland Chinese with VPN access, and other organizations and bots criticizing or supporting China. Such selectivity may lead to polarized sentiments toward China in Chinese-language conversations. Although scholars have presented evidence on the popularity of Sinophobia among English tweets, little is known about the sentiments within the Chinese language communities and what drives these patterns.

To fill the research gap, this article examines how Chinese language users on Twitter engaged in China-related discussions and the associated sentiments during the early COVID-19 pandemic. Specifically, in the Twitter verse, who were those Chinese language users tweeting China-related issues during the pandemic? After the COVID-19 outbreak, how did Chinese language users on Twitter discuss the pandemic and China? What were the main public sentiments toward China? Were they targeting the Chinese people or the Chinese government? Did those pro-China and anti-China users engage in each other’s debate?

To address these questions, we queried the Twitter historical database using keywords related to China, Chinese, the Chinese Communist Party (CCP), and Asians in both simplified and traditional Chinese languages to generate our Chinese Tweets (CNTweets) analytic dataset with over 25 million Tweets by 1.32 million Twitter users between December 2019 and April 2021. We then annotated a training dataset with 10,000 tweets to build a series of deep learning algorithms to classify the sentiment and topics in these tweets by fine-tuning pre-trained Chinese Robustly Optimized Bidirectional Encoder Representations from Transformers with the Whole Word Masking models (Chinese-RoBERTa-wwm-ext) (Devlin et al., 2018; Liu et al., 2019; Cui et al., 2021).

Source of Chinese Tweets

Twitter has been blocked by the Chinese government since 2009 due to information control, so regular Mainland Chinese internet users have to rely on virtual private network (VPN) services to access Twitter (Sullivan, 2012). As a result, Mainland Chinese users on Twitter might be a very selective group of individuals, such as lawyers, journalists, and human rights activists, seeking uncensored information and discussing sensitive topics that are not allowed in China (Song et al., 2015). These anti-Chinese state users are not the only Mainland users who can circumvent the Great Fire Wall. Previous research also shows the prevalence of pro-Chinese state users, for instance, state-sponsored institutional accounts with free access to Twitter and regular pro-China internet users. China has initiated its own foreign propaganda program mainly carried out by state-run media enterprises, such as China Central Television, China Daily, Global Times, and Xinhua News. Individual pro-state users could be part of the paid 50-cent party, government employees, and other regular nationalistic internet users (Bolsover and Howard, 2019; King et al., 2017). In addition to Mainland Chinese, Chinese language users on Twitter could stem from other countries and regions with a population of Chinese language speakers, overseas Chinese, or immigrants of Chinese descent, such as Hong Kong, Macao, Taiwan, Singapore, Thailand, the US, Australia, and Canada. Twitter has been a battlefield for anti-Chinese state groups with few financial resources who are using Twitter to spread misinformation and disinformation on China and Chinese politics (Bolsover and Howard, 2019). The diversity of Chinese language users on Twitter motivates our first research question pertaining to the sources of Chinese tweets.

RQ1: Who were those Chinese Twitter users mentioning China-related issues during the early pandemic?

Sentiment of Chinese Tweets

A large body of literature has used Twitter to gauge public sentiments and the associated impacts on political, economic, and social outcomes, such as elections (Tumasjan et al., 2010; Bovet and Makse, 2019; Shmargad, 2022), stock market (Ranco et al., 2015), and public policies (Flores, 2017). Like other social media platforms (e.g., Weibo, Facebook), public sentiment on Twitter is a mix of regular internet users, opinion leaders, organizations, and social bots, and it is part of the algorithmically infused societies co-shaped by algorithmic and human behavior (Wagner et al., 2021).

Prior studies show that both pro- and anti-Chinese state groups have used Twitter as a platform to serve their propaganda purposes (Bolsover and Howard, 2019). However, these studies tend to focus on non-Chinese audiences, and limited research has examined how these groups target Chinese language users on social media platforms. For instance, Bolsover and colleagues find no evidence of pro-Chinese state computational propaganda on Twitter but strong evidence of massive tweets associated with anti-Chinese state perspectives published in simplified Mandarin (Bolsover and Howard, 2019). This is partly due to the fact that China’s foreign propaganda has been carried out by traditional state-run media groups such as China Central Television and Global Times with massive human and monetary resources. However, these anti-Chinese state groups have used computational propaganda to promote and disseminate their messages targeting the Chinese government due to its lower operating costs. Thus, we might observe a lot of anti-Chinese state behavior on Twitter.

For pro-Chinese state groups, prior studies have shown the rise of Chinese digital nationalism (DeLisle et al., 2016; Schneider, 2018). Cyber nationalists, especially young Chinese internet users, have defended China and the Chinese government on Western social media platforms without state blessings, such as Little Pink (xiaofenhong, i.e., young Chinese nationalists on the internet) and Diba Expedition (diba chuzheng, i.e., cyber-nationalism organized by the Diba, a Chinese online community) (Han, 2019; Bi, 2021). These cyber nationalists tend to engage in conversations with their opposing groups instead of posting comments like social bots. Previous research shows that government employees have played an important role in fabricating pro-Chinese messages online (King et al., 2017) and using the click-bait strategy to gain visibility (Lu and Pan, 2021). In addition, in recent years, Beijing has initiated a series of campaigns via soft power messaging and COVID-19 diplomacy to tell China’s story well (Huang and Wang, 2019). Thus, the complexity and dynamics of pro- and anti-Chinese state groups lead us to the second set of research questions.

RQ2: What was the overall pattern of public sentiments during the early pandemic?

RQ3: Who were the main targets of positive and negative sentiments?

RQ4: Were there any conversations between pro-China and anti-China Twitter users?

Content of Chinese Tweets

Twitter has been a public sphere since its founding. After the COVID-19 outbreak, Twitter, like other social media platforms such as Facebook and Weibo, has been one of the major online spaces where individuals seek social support, track government announcements, and monitor the spread of the coronavirus (Lu et al., 2021). We focus on any Chinese tweets mentioning China-related keywords during the pandemic. We expect that Chinese Twitter users, such as overseas students and Chinese immigrants would use Twitter to share news and seek help when COVID-19 emerged.

Twitter has also been a fierce battlefield for conspiracy theories, hate speech, misinformation, disinformation, and fake news. COVID-19 has led to a global surge of anti-Chinese sentiment (Cook et al., 2021), and racial slurs targeting Asian and Asian American communities have been widely spread on Twitter such as Chinese Virus and KungFlu (Ziems et al., 2020). Chinese Americans and overseas Chinese students might use Twitter as a platform to voice themselves and combat racism and anti-Asian attacks.

The increasing tension between the United States and China such as trade wars and human rights issues pertaining to Xinjiang and Tibet and the Trump administration’s strict policy on Chinese scientists might also spark overseas Chinese users to share concerns on the US-China relations, discuss immigration policies, and express anger or fear of uncertainties in the pandemic. Pro-democracy groups might use Twitter to discuss sensitive topics such as the Xinjiang re-education camp, Uyghurs, and Falungong, while pro-Chinese state users including state-sponsored organizations and the paid 50-cent party might use Twitter to promote China’s soft power and boost China’s global image by tweeting Chinese culture, economic development, tourism, and so on.

The 2019–2020 protest cycles in Hong Kong have drawn great attention from Chinese and global societies. Protesters used Twitter as a platform to diffuse protest information, mobilize resources, and seek solidarity, while pro-Chinese state supporters might also strategically use Twitter for political propaganda by framing protests as conflict and violence, disrupting social order and economy, and destabilizing national security (Zhang et al., 2021). Twitter is also an online space where Chinese state-backed media and nationalists promote the reunification between Mainland China and Taiwan (Chang et al., 2021). Similarly, Taiwan independence supporters might use Twitter to seek for support.

Due to the diversity of Chinese Twitter users and the confluence of COVID-19 and other political and social events, this leads to our third set of research questions.

RQ5: What was the content of these Chinese Tweets during the early pandemic?

RQ6: Was the overall sentiment pattern driving by specific topics during the early pandemic?

RQ7: Was there any variation in topics among different types of Twitter users?

Data and methods

Next, we first introduce how we collected the Chinese tweet dataset (CNTweets). Then, we describe how we constructed our training dataset used to build deep learning algorithms to classify sentiments and topics of tweets. Given that each research question requires different methods, we elaborate more on the specific method used for each research question.

CNTweets data

We used Chinese keywords to retrieve all matched tweets posted in 2019–2021 from Twitter’s historical database using academic Twitter API. Section 1 in supporting information (SI) documents the detailed keywords we used in data collection. We collected over 25 million tweets by 1.32 million users mentioning any keywords in simplified and traditional Chinese characters related to China, Chinese, and CCP. Table 1 shows the descriptive statistics of our Twitter data.

Table 1 Summary of Twitter Data.

Training Data

In order to extract sentiments and topics in CNTweets data, we annotated a training dataset with 10,000 tweets to build deep learning algorithms to classify CNTweets. Section 2 in SI documents the detailed process of our training data construction, and here we briefly summarize the major steps. We started with those well-known pro- and anti-China Twitter users in the Chinese Twitter community and their followers or following accounts (e.g., PDChinese, dajiyuan). We scraped all their tweets posted in the past 2 years. We also used pro- and anti-China hashtags and keywords (e.g., against CCP) to extract potential tweets that either support or criticize the Chinese government or China. We then used a stratified sampling strategy to select 7000 tweets from these potential positive or negative tweets targeting China. To add more potential neutral tweets to our training dataset, we then randomly selected 3000 tweets from our CNTweets data to construct the final 10,000 tweets for human annotation. We hired both graduate and undergraduate research assistants to manually annotate the sentiment and topics in these tweets. Each tweet had been labeled by at least two annotators, and if there was inconsistency, one of our authors then adjudicated the difference.

Source of Chinese Twitter users

To tackle the first research question on the sources of Chinese Twitter users, we rely on partial information provided by Twitter users’ self-reported locations when they signed up for a Twitter account. To extract the major countries and regions, our location analysis first used regular expressions to search full names and abbreviations of a country or region and then searched states/provinces/major cities in a country or region. For instance, to identify whether a Twitter user is from the United states, we first searched the United States, U.S., or US, and then incorporated different states, cities, and their abbreviations such as New York and NY. Readers should be cautious when interpreting the results, as we might underestimate the total number of users who reported their locations or fail to capture the difference between users’ displayed locations and their actual locations due to address changes and reporting false locations. Note that the self-reported location analysis is highly sensitive since it depends on whether Twitter users reveal their true locations. In addition, we also asked our annotators to identify whether a tweet is related to personal opinions, organizations, government announcements, or spam. This allows us to identify whether these tweets are from individual or organizational accounts.

Sentiment of Chinese Tweets

To answer the second question about the overall pattern of public sentiments, we fine-tune the pre-trained Robustly Optimized BERT Pretraining Approach (RoBERTa) with the Whole Word Masking models (Chinese-Roberta-wwm-ext) (Liu et al., 2019). The recent development in natural language processing with deep learning techniques shows that BERT has outperformed other state-of-the-art language models (Vaswani et al., 2017; Devlin et al., 2018; Cui et al., 2021). BERT is the most state-of-the-art language representation model, which stands for Bidirectional Encoder Representation from Transformers (for more technical details, see Devlin et al.’s work). It is trained on large-scale unlabeled texts by randomly masking some of the tokens from the input (i.e., mask language model) and taking the input’s both left and right contexts into account (i.e., bidirectional contextual embedding). We used the pre-trained Chinese-RoBERTa-www-ext model and fine-tuned the last classification layer and several hyper-parameters of the model such as learning rate and batch size. The fine-tuned RoBERTa models were then used for our specific downstream tasks (i.e., sentiment analysis and topic classification).

Table 2 shows our accuracy and F1 scores for the sentiment classifier. We classified each tweet’s sentiment toward China into three categories—positive, negative, or neutral. Note that here we broadly define China. China can be a nation as a whole, Chinese people, Chinese central/local government, CCP, State-sponsored enterprises and organizations, places, and other entities related to China. We also compared the performance of RoBERTa with the performance of other BERT models such as MacBERT and Multilingual BERT based on our annotated training datasets, but RoBERTa outperformed others consistently.

Table 2 Model performance on sentiment and target classification.

To tackle the third research question about the targets of public sentiments, we build RoBERTa models to further discern the target entities: the Chinese people, the Chinese government, or China in general. If a tweet mentions anything related to ordinary Chinese people, we label it as “Chinese people". If a tweet discusses the political system in China, we label it as “Chinese government”. Examples of entities in the category include the central/local Chinese government or CCP, general politics in China, police departments, state media, state-sponsored companies, major political figures in China, and Beijing or Zhongnanhai when they are used to refer to the government. Sometimes people mention the Chinese government such as “China” and “authoritarian regime” without using any specific term related to the government. In this case, it requires our annotators to use their own judgment to identify their targets and label those tweets. If a tweet discusses China but can’t be categorized as “Chinese people” or “Chinese government,” we label it as “China in general," such as festivals and traveling. Table 2 shows our accuracy and F1 scores for the target classifier. It is noteworthy that a tweet can contain multiple entities as we trained three separate classifiers to identify the target entities.

To answer the fourth research question on the dynamics between pro- and anti-Chinese state groups, we rely on social network analysis techniques. We used the conversation_id from Twitter to construct a bipartite conversation network based on whether these pro- and anti-users engaged in the same conversations. Twitter assigns a unique conversation_id to all the reply threads, and this conversation_id matches the original tweet that started the conversation. Thus, a conversation contains all replies to a given tweet and replies to those replies from the single original tweet. This new API feature allows scholars to retrieve and reconstruct an entire conversation thread and understand how conversations and ideas evolve on Twitter. On top of the conversation network, we also conducted a retweet network analysis (see Section 5 in SI) and the results are consistent. But we prefer the conversation network over the retweet network in the main text because the conversation network with a series of replies signals more in-depth engagement compared to a simple retweet.

Content of Chinese Tweets

To address the fifth research question about the content of Chinese tweets, we train a series of classifiers to identify whether a tweet is related to COVID-19, politics, economy, culture, religion, and the US. We asked annotators to label each tweet into these different topics when we were building our training datasets. These topics were selected based on the consideration of geopolitics centering on China during the early pandemic. For instance, during the early pandemic, the US-China trade war, Taiwan issues, Hong Kong protests, and the US presidential election drew great attention from journalists, policymakers, and the public. Thus, for politics, we further discerned US politics, Hong Kong politics, and Taiwan politics. Our annotators were trained to label these topics to each tweet in our training dataset. Table 3 shows our accuracy and F1 scores for each classifier. Because some classifiers have a lower F-1 score, we also supplement our topic classification results with structural topic modeling for robustness check (Roberts et al., 2019). Structural topic model, as an unsupervised text analysis tool, has been used to retrieve information from large-scale textual data and its advantage is to allow researchers to flexibly estimate how document-level metadata shapes topic prevalence compared to the conventional latent dirichlet allocation (LDA) model (Blei et al., 2003; Roberts et al., 2014). The most straightforward understanding of topic models is to see each document as a function of themes or topics governed by some prior distribution and each theme is a distribution of words in the fixed vocabulary, and topic modeling is to find these two sets of parameters that best fit the observed data. When estimating topic models, researchers need to pre-define the number of topics in the documents. We chose the structural topic model over others because of its capacity to add metadata like timing (e.g., month) into the model estimation process, and we ran a series of structural topic models with different topics (e.g., K = 30, 50, and 100). In the main text, we only present the model with K = 30.

Table 3 Model performance on topic classification.

To address the sixth question on the relationship between topics and sentiments, we run regression models with monthly fixed-effects terms to test whether some topics such as politics, economy, culture, religion, COVID-19, and US-China relation were driving the overall sentiment pattern towards China.

To address the seventh question on the topic variation by pro- and anti-China users, we focus on two types of accounts that either support or oppose China. We analyzed the differences in the content of their posted tweets in our CNTweets database.

Results

The sources and types of Chinese Twitter users

To recapitulate, our first research question asks about who mentioned China-related issues during the early pandemic, so we begin by describing the overall pattern of who created these Chinese tweets. The descriptive analysis shows that 1% of Twitter users generated 62% of total Chinese tweets during the early pandemic in our database from December 2019 to April 2021. Notably, 10% of Twitter users contributed to 90% of total Chinese tweets in our CNTweets database. Thus, in the Twitter verse of Chinese language users, the majority of Chinese tweets targeting China, Chinese, CCP, and Asians in either a positive or negative direction were driven by a handful of Twitter users (around 13 thousand).

We then ran a geospatial analysis of Chinese Twitter users’ self-reported locations. The majority of Chinese language Twitter users reported a location in Mainland China, the US, Taiwan, or Hong Kong. Among 1.32 million Twitter users in our dataset, 0.58 million (43.83%) of them self-reported a location on their public profiles. Among those who reported certain information in the location part of the profile, we were able to identify 0.33 million (58%) users’ countries/regions (e.g., Europe, Singapore, Indonesia, Japan). Among those users with identified countries/regions, the majority reported a location in Mainland China (31.62%), the US (18.09%), Taiwan (8.95%), and Hong Kong (8.59%). This is reasonable as these countries or regions contain a large population of individuals who speak Chinese.

We also find that the majority of Chinese language tweets were associated with personal opinions, followed by news content. We trained a RoBERTa classifier to discern the types of these tweets. Each tweet was classified into personal opinions (i.e., any personal expression such as personal opinions, comments, discussion, or emotions about any topic), news content (e.g., news related to COVID-19, China, the US, or other countries), government or any other institutions’ announcements (e.g., announcements by government officials and World Health Organization’s health advice), advertisements and spam, and others. We find that 68.4% of tweets were related to personal opinions, 27.6% were associated with news content, 0.71% of tweets were related to governments’ or other institutions’ announcements, and 2.16% were advertisements and spam. This suggests that Chinese language users have used Twitter as a public space to express opinions towards China instead of retweeting news-like content or government announcements.

The overall sentiments and main targets

The second research question asks about the overall pattern of public sentiments during the early pandemic. Our RoBERTa sentiment classifier shows that the sentiments in the Chinese tweets were predominantly negative toward China. As shown in Table 4, among 25 million tweets in our CNTweets database, 15.74 million were classified as negative toward China, 5.54 million were neutral, and only 4.02 million were positive. Tweets sharing negative, positive, and neutral sentiments toward China accounted for 62%, 22%, and 16%, respectively, during the early pandemic.

Table 4 Descriptive statistics of prediction results on sentiments and targets.

Figure 1 shows the time series of the percentage of positive, negative, and neutral tweets. It suggests a robust pattern that the Chinese Twitter community was consistently negative toward China during the early pandemic. But there are some nuances in the sentiment pattern as we can see how sentiments in Chinese tweets resonate with major events related to the COVID-19 pandemic and the US election. After the first cases were confirmed in Wuhan in mid-January 2020, the daily number of negative tweets soared first and then declined, but this increased again after former US President Trump tweeted about the racial slur “Chinese Virus”. For instance, on March 16, 2020, Trump tweeted, "The United States will be powerfully supporting those industries, like Airlines and others, that are particularly affected by the Chinese Virus. We will be stronger than ever before!" It is noteworthy that the spike in the neutral trend in early-November 2020 was due to the shifted attention toward the voter fraud conspiracy in the US 2020 presidential election, while the early February spike in 2021 was due to the discussion of the World Health Organization (WHO)’s preliminary report on the origins of the COVID-19 coronavirus (i.e., it is unlikely leaked from a Wuhan Lab) by both pro- and anti-China users. Note that the time series of unique Twitter users based on these positive, negative, and neutral tweets shows a similar pattern, and there were consistently more active users tweeting negative posts over time (see Fig. S1 in SI).

Fig. 1: Daily trends of positive, negative, and neutral sentiment toward China on Twitter.
figure 1

The sentiment toward China in Chinese language communities was predominantly negative during the early pandemic. This figure is covered by the Creative Commons Attribution 4.0 International License. Reproduced with permission of the Authors; copyright © the Authors, all rights reserved.

Keywords analysis shows that China and CCP were more likely to be mentioned than people of Asian or Chinese descent. Figure 2 shows the daily trends of China, CCP, and people of Asian or Chinese descent (亚裔/华裔). It clearly shows that Chinese language Twitter users mentioned China and CCP more often than people of Asian or Chinese descent (see Fig. S2 in SI for the user level analysis). Figure 2 also shows that China/CCP keywords surged during the early pandemic, peaked after former US President Donald Trump tweeted “Chinese Virus” on March 16, 2020, and then remained relatively steady. For Asian-related keywords, we have a similar pattern during the early pandemic, but these keywords also surged after March 2021 because of the tragic Atlanta SPA mass shootings. The mention of Asian or Chinese descendants was likely to be associated with these StopAsianHate movements.

Fig. 2: Daily trends of Chinese Tweets mentioning China, Asians/Chinese, and Chinese Communist Party (CCP).
figure 2

Keyword analysis of Chinese tweets shows a consistent pattern that Chinese language users mentioned China and CCP more compared to People of Asian or Chinese descent. This figure is covered by the Creative Commons Attribution 4.0 International License. Reproduced with permission of the Authors; copyright © the Authors, all rights reserved.

Overall, the sentiment toward China was negative. But who were they targeting? To address our third research question about the main targets of positive and negative sentiments, our sentiment target analysis shows that most negative tweets were targeting the Chinese government or China in general instead of the Chinese people. Figure 3 shows the daily trends of tweets targeting different China-related entities. The majority of sentiments in the CNTweets database were directed toward the Chinese government. During the early pandemic, around 60% of tweets were targeting the Chinese government, around 11% were targeting Chinese as an ethnic group, and around 25% were targeting China in general. Similarly, if we focus on active Twitter users instead of tweets, we find a similar pattern, as there were consistently more active Twitter users targeting the Chinese government during the early pandemic (see Fig. S3 in SI).

Fig. 3: Daily trends of Chinese Tweets targeting different entities.
figure 3

Chinese language communities on Twitter were consistently targeting the Chinese government and China in general instead of the Chinese people. This figure is covered by the Creative Commons Attribution 4.0 International License. Reproduced with permission of the Authors; copyright © the Authors, all rights reserved.

For those tweets with negative sentiments, as shown in Table 5, 80% were targeting the Chinese government, 11% were targeting the Chinese people, and 19% were targeting China in general. For those tweets with positive sentiments, the proportions associated with the Chinese government, the Chinese people, and broad China were 20%, 34%, and 46%, respectively. These results suggest that negative tweets were more likely to target the Chinese government and positive tweets were more likely to support China in general.

Table 5 Proportion of target entities by different sentiments.
Table 6 Proportion of main content.

A Network analysis of pro- and anti-China Twitter users

To address the fourth research question on the engagement between pro- and anti-China Twitter users, we use the results from sentiment analysis to classify Twitter users into pro-China and anti-China users based on the rate of positive tweets. If a user’s positive rate is >0.6, we label it as a pro-China user; if it is <0.4, we label it as an anti-China user. We have 459,821 anti-China users and 496,504 pro-China users.

Then we constructed a conversation network for these pro- and anti-China users in our database based on whether these users engaged in the same conversations using Twitter’s conversation_id. Twitter assigns a unique conversation_id to each tweet if they engage in the same conversation thread. Typically, the conversation_id is identical to the tweet id posted by the first user and other replies to this post or its replies share the same conversation_id. For these identified pro- or anti-China users, we observed 19.82 million unique conversations in our database. We also find that 96.4% of these tweets contained no replies or engagement with others. Among these tweets with no replies, pro-China users contributed 1.78 million while anti-China users contributed 17.32 million. Notably, 0.83 million conversations had at least one pro-China user and one anti-China user. Thus, conversations between pro- and anti-China users only accounted for 4.7% of total conversations that occurred among identified pro- or anti-China users in our database. To further quantify the segregation level between pro- and anti-China users, we computed the E-I index, a measure of homophily in social networks, to capture the difference between between-group and within-group ties (Krackhardt and Stern, 1988; Bojanowski and Corten, 2014). The E-I index will take +1 if all ties fall into between groups and −1 if all ties are within groups. For more technical definitions, see Section 3 in SI. The E-I index based on our conversation network was −0.33. This clearly shows that pro- and anti-China users were more likely to engage within their own groups and lacked in-depth cross-boundary engagement with each other. We also conducted an additional retweet network analysis (see Section 5 in SI) and the results are similar. The E-I index for the retweet network was −0.906, suggesting an even more segregated pattern between pro- and anti-China users on retweeting behavior.

Figure 4 visualizes the conversation network among Twitter users. We only plot Twitter users with at least 10 conversations for ease of illustration. Red dots indicate pro-China users, while blue dots denote anti-China users. It clearly shows the polarized pattern that pro-China and anti-China users were clustered into their own groups, but pro- and anti-China users did engage in some dialogues that might support or criticize China. There are some nodes in Fig. 4 that were attracting attacks from the other side. For 219,985 conversations with at least one pro and anti-China user, we find that 23% only had one pro-China and one anti-China participant, and the majority (74%) of these conversations had <10 pro- or anti-China users. Only 26% of these conversations involved over 10 pro-China or anti-China users. Taken together, these findings suggest a polarized pattern that pro- and anti-China users lacked in-depth engagement in China-related conversations. Given that we focus solely on Chinese Tweets centering on China, our results cannot be extended to other conversations not pertaining to China, Chinese, or CCP.

Fig. 4: Conversation network visualization of Pro- and anti-China users.
figure 4

Nodes are pro- (Red) and anti- (Blue) China Twitter users and edges indicate at least 10 conversations between two nodes. Pro- and anti-China users were more likely to be segregated from each other, even though there was some cross-boundary engagement. This figure is covered by the Creative Commons Attribution 4.0 International License. Reproduced with permission of the Authors; copyright © the Authors, all rights reserved.

The content of Chinese Tweets

Next, we move to the fifth research question on the content of these Chinese tweets. Overall, the majority of tweets were related to politics, followed by democracy and freedom, US issues, and COVID-19 topics. As mentioned earlier, we trained a series of RoBERTa classifiers to identify potential topics in these tweets. As shown in Table 6, our RoBERTa topic classifiers show that 73% of tweets were broadly related to politics. Politics could be any topic related to ideology, democracy, policy, major figures in China or other countries, geopolitics, etc. More specifically, 31% were associated with discussions on democracy and freedom, 22% were discussing US politics such as domestic issues and elections, 9% were discussing Hong Kong protest issues, and 6% were mentioning Taiwan politics. 27% of these tweets were related to US topics. Note that 14% were related to the US-China relation. This is reasonable as the trade war between China and the United States. 20% of tweets discussed COVID-19 issues, while culture, economy, and religion-related topics only accounted for 6%, 5%, and 2%, respectively. Note that we define economy topics as any economic issues such as infrastructure investment, economic growth, and the development of industrialization and modernization. Culture-related topics include travel, food, sports, art, entertainment, etc. Religion topic focuses on religious freedom or other religious issues.

The keyword analysis shows that COVID-related keywords quickly peaked in Chinese language communities after the outbreak, but US and Hong Kong-related topics prevailed during the early pandemic. Figure 5 shows the daily trend of some keywords of interest, including COVID-19, Taiwan, the US, Hong Kong, Tibet, and Xinjiang. These were the major issues targeted by anti-China Twitter users. Unsurprisingly, COVID-related Chinese keywords increased rapidly in the Twitter community after the outbreak, peaked after March, and declined after April 2020. However, the US and Hong Kong-related topics were often discussed in the community as the US-China trade war and Hong Kong protests were dominating the issue attention cycle during the early pandemic, followed by Taiwan, Xinjiang, and Tibet issues.

Fig. 5: Daily trends of Chinese Tweets mentioning Hong Kong, Taiwan, Xinjiang, Tibet, the US, and COVID-19.
figure 5

Keyword analysis shows that COVID-19, Hong Kong, and the US were frequently mentioned by Twitter users in Chinese language communities during the early pandemic. This figure is covered by the Creative Commons Attribution 4.0 International License. Reproduced with permission of the Authors; copyright © the Authors, all rights reserved.

Structural topic modeling also shows that the most popular themes in CNTweets were China’s domestic politics, COVID-19, US politics, and Hong Kong and Taiwan issues. Figure 6 plots the distribution of themes extracted from our CNTweets data. We estimated 30 topics using the structural topic model. Results suggest that democracy-freedom (8%), US election (6.9%), global issues (6%), the 50-cent party (i.e., supporting CCP, 5.4%), culture-education (5.1%), COVID-19 (4.9%), Hong Kong-National Security Law (4.8%), Wuhan outbreak(4.8%), human rights (e.g., Xinjiang, 3.7%), and the US-China Initiative (3.6%) were the top 10 themes during the early pandemic on Twitter. Other prevalent topics include COVID origin (made in a Wuhan lab), Huawei Ban, Chinese policing, the Chinese economy, anti-CCP, etc.

Fig. 6: Structural topic model output, K = 30.
figure 6

The most popular themes from CNTweets were related to freedom and democracy, the US election, geopolitics, and COVID-19. This figure is covered by the Creative Commons Attribution 4.0 International License. Reproduced with permission of the Authors; copyright © the Authors, all rights reserved.

The dynamics between topics and sentiments

To address the sixth question on the relationship between topics and sentiments, we ran the logistic regression model using topics to predict whether a tweet’s sentiment is positive towards China. Table 7 reports logit coefficients from the model focusing on different topics. Model 1 explores which topics among COVID-19, politics, religion, culture, economy, and US-China relation were more likely to be positive towards China. We show that a tweet is less likely to be positive towards China when it pertains to COVID-19, politics, religion, and US-China relation but more likely to be positive if it relates to cultural or economic issues. We further plot the daily trends of the fraction of negative tweets by topics in Fig. 7. It suggests that the negativity towards China among Chinese language communities during the early pandemic was mostly driven by the discussion on politics, followed by COVID-19 and US-China relation topics. After the outbreak of the coronavirus, the percentage of negative COVID-19 tweets in our database increased rapidly but then declined and was surpassed by the negative tweets on US-China relation after June 2020.

Table 7 Logistic regression results predicting positive sentiment at the Tweet level.
Fig. 7: Daily trends of proportion of negative Tweets by topics.
figure 7

The negativity in the database was driven by the surging negative COVID-19 and political topics. This figure is covered by the Creative Commons Attribution 4.0 International License. Reproduced with permission of the Authors; copyright © the Authors, all rights reserved.

The topic variation between pro- and anti-China users

To examine the final research question about whether different types of users engaged in distinct topics, we ran an additional analysis to compare topic proportions between 459,821 anti-China users and 496,504 pro-China users. Table 8 reports the average number of tweets and overall proportions for each topic within all tweets posted by these pro- or anti-China users. Both sides were heavily engaged in topics including politics, the US, and COVID-19 issues. Over 30% of pro- or anti-China users’ tweets involved some aspect of politics.

Table 8 Average Tweets by pro- and anti-China users.

Pro-China users were more likely than anti-China users to tweet about economy, culture, COVID-19, and US issues, compared to topics like politics. For an average pro-China user in our CNTweets database, as shown in Table 8, they were less active in terms of the average number of posts compared to anti-China users. For instance, on average, a pro-China user had 2.88 tweets discussing politics, while an anti-China user had 35.07 tweets. But in terms of the topic shares for all tweets made by these users, pro-China users focused more on economy, culture, COVID-19, and US issues, while anti-China users focused more on politics, particularly related to democracy and freedom and Hong Kong politics. The variation in topics reflects the different agendas of these pro- and anti-China users on Twitter.

Discussion and conclusion

This paper used multi-modal supervised and unsupervised machine learning tools to examine anti-China sentiments and topics in Chinese language communities on Twitter during the early COVID-19 pandemic. Since the outbreak, scholars have shown a global surge of anti-China sentiments. Our work was the first to systematically examine the dynamics of sentiments in Chinese language communities on a major Western social media platform. Compared to other Chinese media platforms like Sina Weibo, Twitter is a public space that attracts users who intend to express their criticism or support toward China. Thus, it affords scholars a window to examine the relationship between pro- and anti-China users online. But readers should note that Chinese language communities on Twitter are not a representative group of Chinese communities or the Chinese diaspora.

Based on the analysis of over 25 million Chinese tweets from December 2019 to April 2021, we find that the majority of these China-related tweets were generated by only 1% of Twitter users. These Chinese language users, who reported a location in Mainland China, the US, Hong Kong, and Taiwan, were more likely to mention China or CCP instead of people of Asian or Chinese descent. The majority of these tweets were personal opinion-oriented, followed by news-like content and government or institutional announcements. These results suggest that tweets targeting Chinese language communities might be a very selective group of users as a handful of Twitter users contributed to the majority of content related to China topics.

We also find that the majority of tweets in our CNTweets database were negative toward China, although these sentiments were more likely to target the Chinese government or China in general instead of the Chinese people. These pro- and anti-China Twitter users were predominantly segregated as they were more likely to engage in conversations on their own side, and only a small size of Twitter users engaged in conversations on the other side. Note that we find evidence that pro-China users contributed 1.78 million tweets with no replies by others while anti-China users contributed 17.32 million in our database. These results suggest that Twitter has been used as a major platform by anti-China users to criticize the Chinese government and CCP. Prior research has shown the lack of evidence related to computational propaganda by CCP but strong evidence of computational propaganda by anti-China groups on Twitter due to low operating costs (Bolsover and Howard, 2019). Our results show that anti-China users were indeed more active on Twitter than pro-China users during the early pandemic, but given the large volume of tweets from both sides, our work adds evidence to the existence of potential computational propaganda by both pro- and anti-China users (see Fig. S5 in SI). Since we focus solely on Chinese tweets, we cannot extend this conclusion to the entire Twitter verse as CCP might be more likely to target English language communities instead of Chinese language communities.

The most common topics discussed by these anti-China Twitter users were politics, such as democracy and freedom, Hong Kong protests, Taiwan politics, Xinjiang, and Tibet issues. Even though both pro- and anti-China users were heavily engaged in the discussions of politics, pro-China users were more likely to discuss topics related to economy, COVID-19, US issues, and culture, while anti-China users were more likely to focus on topics of democracy, freedom, and Hong Kong politics. Our regression analysis shows that tweets related to culture and economy were more likely to be positive towards China, while tweets associated with COVID-19, politics, religion, and US-China relation were less likely to be positive towards China. These findings echo that pro-democracy activists tend to take advantage of these social media platforms to promote democracy and criticize the Chinese government, while pro-China Twitter users tend to use economy and culture topics to boost China’s international image.

Taken together, our findings show that Sinophobia was ubiquitous among the Chinese language communities on Twitter during the early pandemic, and the Twitter verse is a battlefield that attracts both pro- and anti-China users for their own potential propaganda agenda. Previous studies often focus on the English language communities on social media platforms and overlook non-English communities. The potential propaganda by both parties targeting Chinese ethnic groups might have negative consequences in the community. Many social media platforms have developed policies and tools to mitigate these negative consequences such as blocking hateful terms and suspending controversial accounts, but very few resources have been devoted to communities of minorities. Recent research has shown that the COVID-19 crisis increased censorship circumvention and access to international news and political content blocked in China, but when individuals sought crisis-related information, they were also exposed to misinformation and anti-China racism online simultaneously in their own online language communities.

Readers should note that our research has some limitations. For instance, some classifiers have a relatively low F1 score (e.g., culture, religion, and economy). One of the future directions is to use semi-supervised machine learning methods to improve the predictive power by adding more positive cases. In addition, our location analyses were self-reported by Twitter users instead of their actual geo-locations. Finally, we only obtained tweets during the early pandemic using keywords instead of the whole Twitter verse. We leave these to future research.