On 9 November 2020, Pfizer and BioNTech announced an interim vaccine efficacy of >90% for their mRNA vaccine candidate BNT162b2 (Pfizer and BioNTech, 2020). This was followed by Moderna’s announcement of their mRNA candidate mRNA-1273 with an efficacy of 94.5% on 16 November 2020 (Moderna, 2020). Interim results from AstraZeneca-Oxford of their viral-vectored ChAdOx1 vaccine were released on 8 December 2020 with an overall vaccine efficacy of 70.4% across two cohorts (Voysey et al., 2021). Pfizer then announced an overall vaccine efficacy of 95% and of 94% in the high-risk group of 65–85 year olds on 10 December 2020 (Polack et al., 2020). These fundamental basic scientific discoveries are expected to pave the way for the successful eradication of the coronavirus disease 2019 (COVID-19). Sputnik V and CanSino vaccines had been used in Russia and China, respectively (Sputinik, 2020; Wee and Qin, 2020). Both the Pfizer-BioNTech and Moderna vaccines were approved for emergency use in the USA. The Pfizer-BioNTech, Moderna, and AstraZeneca-Oxford were authorised for use in the UK. Several other countries approved these vaccines in December 2020 and January 2021 (Carvalho et al., 2021). At the time of writing, 32.5% of the UK population and 38.9% of the US population were fully vaccinated. The data also show that 8.6% of Brazilians and 3.0% of Indians have been vaccinated against COVID-19 (Our World in Data, 2021).

With vaccination programs underway in many countries, it becomes important to examine people’s (un)willingness to get vaccinated against COVID-19. Before the COVID-19 vaccine becomes available, a number of survey studies have provided percentages of the populations willing to receive a COVID-19 vaccine, 69% of Americans (Reiter et al., 2020), 69% of British, and 64.9% of Irish (Murphy et al., 2021), 73.9% of Europeans (NeumannBöhme et al., 2020). Research findings also revealed reasons behind vaccine hesitancy and refusal, including concerns about vaccine safety, potential adverse events and side-effects, lack of knowledge of vaccines, lack of trust in pharmaceutical companies (Fisher et al., 2020; Murphy et al., 2021; NeumannBöhme et al., 2020; Reiter et al., 2020). Besides these survey studies, researchers explored social media data (e.g., Twitter tweets, YouTube videos, and Facebook posts) and categorised vaccine sentiments expressed by online users. One study of tweets conveyed that vaccine hesitancy stemmed from the following themes: concerns over safety, suspicion about political or economic forces driving the COVID-19 pandemic, or vaccine development (Griffith et al., 2021). Another study of social media posts categorised 58% of these posts into positive sentiments. Public optimism over vaccine development, effectiveness, and trials were identified (Hussain et al., 2021). As alluded to above, although researchers and regulators stress that the benefits of vaccines outweigh the risks, there is still low vaccine uptake in states and countries. Because of limited socialising during the pandemic, concentrated vaccine information exploded in social media. We thus are motivated to explore vaccine hesitancy factors that are impeding the efforts by public health officials to fight the pandemic. Despite a panoply of previous study findings revealing public perceptions and sentiments toward COVID-19 vaccines from social media data, little attention has been paid to public understandings of vaccine efficacy and their behavioural changes via individual-level social media data over the period of vaccine development and vaccination programs.

Against this background, the objective of this study is to identify and categorise themes of YouTube audience’s perceptions of COVID-19 vaccines using big data analytics. We focus on studying real-time data gleaned from YouTube to understand public thoughts, feelings, and behaviours. A thematic analysis of YouTube comments is conducted to identify the YouTube audience’s perceptions. With identified thematic clusters, we then adapt these clusters as constructs to the Health Belief Model (HBM) (Rosenstock et al., 1988). With this predictive model, we undertake a causal modelling approach to explore the correlations and assess the causal relations amongst identified themes. Specifically, the following are investigated: (1) what are the major themes of YouTube audience’s discussion of COVID-19 vaccines? (2) how do identified thematic clusters influence the YouTube audience’s intention to get vaccinated against COVID-19?

This study provides two key contributions. First, it provides a contribution in terms of research methods that can be implemented by applying text analytics to generate weights from social media data for predictive modelling. We believe it advances the method of big data analytics and generates insights for data analytics researchers. Second, this study adds to previous studies by mapping and prioritising various reasons behind vaccine hesitancy and refusal that can be utilised to devise further vaccine intervention programs. We extended the HBM model by examining existing constructs and newly emerged factors that influence COVID-19 vaccination behaviours. Such insights have practical value for public health professionals and the way they combat the COVID-19 pandemic.

Literature review

We first cite the definition of vaccine hesitancy and three vaccine hesitancy categories proposed by the World Health Organization (WHO). Specifically, we review the COVID-19 vaccine hesitancy factors identified from several survey studies across the world. We continue to review the literature on the COVID-19 vaccine hesitancy in the social media context. The majority of the reviewed papers focus on sentiment analysis and topics of public discourses. After analysing previous studies, we identify the research gap and propose our research objectives. The comparison of traditional surveys and big data analytics is discussed. We emphasise the relevance and meaningfulness of applying big data analytics to the study of the COVID-19 vaccine hesitancy.

Vaccine hesitancy in the COVID-19 pandemic

The WHO Strategic Advisory Group of Experts (SAGE) Working Group defines vaccine hesitancy as a delay in acceptance or refusal of vaccination despite the availability of vaccination services. It is complex, context-specific, varying across time, place, and vaccines (MacDonald and the SAGE Working Group on Vaccine Hesitancy, 2015). Dubé and collaborators (2013) have suggested that vaccine hesitancy should be seen on a continuum ranging from active demand for vaccines to complete refusal of all vaccines. As a heterogeneous group, vaccine-hesitant individuals are in the middle of this continuum. These individuals may agree to take some vaccines and refuse others. They may acquiesce in vaccination but be unsure about specific vaccine uptake (Dubé et al., 2013). The Working Group proposes three categories of factors that affect individuals’ decision to be vaccinated. These factors include confidence (e.g., trust in vaccine effectiveness and safety, trust in the system and health professionals, and policy-makers motivations), complacency (e.g., perceived risks of vaccine-preventable disease), and convenience (e.g., physical availability, affordability, and willingness to pay, and geographical accessibility). The WHO SAGE Working Group developed the vaccine hesitancy determinants matrix with traditional vaccine hesitancy factors, such as personal experience with vaccination, perceived risks/benefits, geographic barriers, religion, cost, and so on (MacDonald and the SAGE Working Group on Vaccine Hesitancy, 2015). In the context of the ongoing COVID-19 pandemic, vaccine hesitancy factors related to the COVID-19 vaccine may be similar but conceptually distinct from traditional hesitancy factors in the circumstances that vaccine safety is not well-established and not widely available. Callaghan and collaborators (2020) cautioned that reasons for vaccine hesitancy and refusal in previous studies might not be generalised and used for the COVID-19 vaccine studies (Callaghan et al., 2020). It is estimated that the threshold for the population to achieve COVID-19 herd immunity through vaccination or prior infection would be 55% (R0 = 2.2) and 82% (R0 = 5.7) (Sanche et al., 2020). A high vaccine hesitancy rate would significantly affect the attainment of immunity. Consequently, it is crucial to explore and identify the public acceptance of this particular vaccine and their reasons for not pursuing vaccination. Previous studies suggested that COVID-19 vaccination intentions vary substantially across countries (Edwards et al., 2020). Early data from a survey of the United States showed that 31% of participants were unwilling to get a COVID-19 vaccine with a reportedly high level of perceived potential vaccine harm (Reiter et al., 2020). Additional reasons for vaccine hesitancy included vaccine-specific concerns, a need for more information, anti-vaccine attitudes or beliefs, and a lack of trust (Fisher et al., 2020). A nationally representative survey study from the UK reported that 26% of respondents were hesitant about a COVID-19 vaccine (Murphy et al., 2021). Another survey study in seven European countries (i.e., Denmark, France, Germany, Italy, Portugal, the Netherlands, and the UK) found that 18.9% of participants stated they were unsure, and 7.2% said they would not get vaccinated. Reasons for hesitating to get a COVID-19 vaccine included concerns about potential side-effects and vaccine safety (NeumannBöhme et al., 2020). Taken together, vaccine hesitancy and refusal can be major barriers to attaining herd immunity.

COVID-19 vaccine hesitancy and social media

Social media provides users with non-traditional sources for health information, including blogs and alternative news sites. It offers an outlet to share vaccine opposition beliefs widely in real-time (Deiner et al., 2019). Vast and varied discussions on vaccinations are going on in social media even before the COVID-19 vaccine becomes available. To date, a variety of studies have shown the impact of social media information on public vaccination decision-making. Johnson et al. (2020) provided a system-level analysis of the online ecology of Facebook users’ vaccine views. The illustration of the evolution of online ecology showed that anti-vaccination clusters became highly entangled with undecided clusters in the online network. Pro-vaccination clusters were more peripheral in the illustration. The study’s theoretical framework predicted that anti-vaccination views would dominate in a decade (Johnson et al., 2020). More recently, Loomba and collaborators (2021) conducted a randomised control trial in the UK and USA to measure the impact of the COVID-19 vaccine misinformation on vaccination intent. The findings indicated that misinformation had induced declines in intent in the UK and USA, with points of 6.2% and 6.4%, respectively. Additionally, scientific-sounding misinformation was associated with reductions in vaccination intents (Loomba et al., 2021). The growing interest in studying the impacts of social media information motivates us to conduct a literature review. Since the review focuses on social media and COVID-19 vaccines, we searched key terms including ‘social media,’ ‘COVID vaccine,’ ‘Twitter,’ ‘Facebook,’ ‘YouTube,’ ‘Reddit,’ ‘TikTok,’ and ‘Instagram,’ and a combination of terms in Google Scholar. We limited our search to studies in English. We screened the titles, abstracts, and full texts to evaluate the suitability against inclusion and exclusion criteria. Disagreements were resolved through discussion, and we reached a consensus. Reasons for exclusion of studies were: not addressing COVID-19 vaccines, contents related to specific vaccines such as influenza, measles, and Human Papillomavirus (HPV), contents related to vaccines in general, not peer-reviewed, and editorials, letters, and commentaries. The resulting articles were coded with reference details, social media platforms, methods, sample size, and main results (see Supplementary Appendix 1). Among fifty-two articles, we identified nine relevant articles related to COVID-19 vaccines in the social media context. Most of the studies applied content analysis of social media data. For instance, Basch and collaborators (2021a) documented the numbers of YouTube videos, numbers of views, and proportion of cumulative views (Basch et al., 2021a). They extended the work to study TikTok videos and explore sentiments of encouraged and discouraged use of COVID-19 vaccines (Basch et al., 2021b). Three studies focused on categorising topics themes, including grouping most reliable, less reliable, and unreliable Twitter information (Jamison et al., 2020), grouping Twitter vaccine hesitancy themes (e.g., safety and political scepticism) (Griffith et al., 2021), and characterising major vaccine hesitancy topics (e.g., conspiracy, developing speed, and safety) (Thelwall et al., 2021). The rest of included articles employed machine learning to investigate sentiments of online data regarding COVID-19 vaccines. One study manually annotated tweets collected after the first vaccine announcement and concluded that most tweets had a neutral stance (Cotfas et al., 2021). Another study analysed the sentiment trends in the UK and US. The geospatial mapping of social media public sentiment helped identify areas with more negative sentiments toward COVID-19 vaccines. The comparative analysis revealed that average Facebook and Twitter comments were mainly positive in both countries (Hussain et al., 2021). Praveen et al. (2021) conducted sentiment analysis and topic modelling to identify Indian citizens’ attitudes and concerns about COVID-19 vaccines (Praveen et al., 2021). Despite a number of studies have identified public sentiments and topics of public discourses in major social media platforms, little attention has been paid to explore the impacts of identified topics/factors on social media users’ vaccination intention. In this study, we intend to identify public discourse topics themes regarding COVID-19 vaccines. We are motivated to take a further step to explore the relationships between vaccine hesitancy factors and the public’s intention to get vaccinated against COVID-19.

Survey and big data analytics

As we observed the abovementioned acceptance rates of COVID-19 vaccinations across countries, most studies were conducted using surveys. These traditional methods are characterised by slow and costly research, small sample size, and low response rates (On et al., 2019). Traditional survey methods offer useful evidence, both cross-sectional and regional but lack individual-level data. Furthermore, it is difficult to track changes in nuanced beliefs among different populations over time (Dredze et al., 2016). For instance, while using surveys for exploring public concerns about vaccinations, researchers found that the methods were limited in the ability to detect sudden changes in confidence levels (Karafillakis et al., 2017). Yet, it is critical to monitor and analyse changes in the perceptions of vaccinations continuously. In contrast to traditional survey methods, social media analytics allows researchers to observe many individuals’ beliefs, especially in the pandemic, without reaching the people directly and physically. Specifically, social media analytics provides unprecedented, real-time access to people’s perceptions, attitudes, and behaviours (Dredze et al., 2016). In the case of vaccinations, social media becomes a hotbed for anti-vaccine activities. Applying social media analytics can fill in gaps left by traditional methods on vaccine hesitancy and refusal (Dredze et al., 2016). For example, Dredze et al. (2016) noted that researchers could better capture the opinions of vaccine refusals and investigate public messages shared by members of a community of interest (Dredze et al., 2016). Moreover, by directly observing online messages, researchers can circumvent bias concerns of participants who respond in ways they are expected during survey data collection. In the case of sensitive and controversial topics, such as vaccine hesitancy and refusal, social media data may provide more accurate information (Dredze et al., 2016). Du et al. (2020) demonstrated the feasibility of deep learning of social media posts in understanding population-level and individual health beliefs and attitudes toward vaccines (Du et al., 2020). Larson (2020) pointed to the need for the use of social media in listening to public concerns in real-time, given the changing nature of vaccine sentiments (Larson, 2020).

In the era of big data, social media data is considered raw and organic materials that need to be purposefully processed, aggregated, and transformed into insights that provide meanings given a specific context (Hallikainen et al., 2020). Insights are attained through analytics, including extracting and interpreting hidden patterns of online public discourses. Previous studies have used social media data for opinion mining, sentiment analysis, and topic modelling to understand health-related issues (Griffith et al., 2021; Hussain et al., 2021). Few studies documented the numbers of YouTube videos, numbers of views, and proportion of cumulative views (Basch et al., 2020). Big data analytics, used in this study, thus refers to acquiring, storing, processing, and analysing a large collection of complex COVID-19 vaccine-related data. Rather than cataloguing YouTube videos, we intend to use the text mining method to dissect YouTube comments, discover information, and advance analyses of the collections of textual data. We use YouTube video viewers’ comments to understand their thoughts, feelings, and behaviours. We aim to create meaningful and real-time information to support public health professionals’ decision-making for the benefits of increased population vaccination to reduce susceptibility to COVID-19.


We conducted mixed-method studies to explore and identify social media discussion patterns and themes and investigate the relationships between identified themes/constructs regarding COVID-19 vaccines. The mix-method approach included a thematic analysis of qualitative data and a causal modelling approach to explore and assess the causal relations using quantitative data derived from the text mining method. A thematic analysis of YouTube video comments with labelled clusters allows us to understand video viewers’ prioritised discussion topics in social media conversations. We utilised these clusters to perform multiple regression analyses to examine the relationships between vaccine hesitancy factors and intentions to get vaccinated against COVID-19. This approach also investigates the impacts of misinformation on social media users’ hesitancy to take vaccine jabs.

Data collection

After reviewing previous studies (Cotfas et al., 2021; Griffith et al., 2021), indicated by Google trends with the highest number of searches, that is, the first announcement of Pfizer-BioNTech vaccine candidate over 90% effective in the first interim efficacy analysis, we focused on the announcements of the COVID-19 vaccine efficacy. We also included the announcements of Moderna and AstraZeneca-Oxford COVID-19 vaccine efficacy in November 2020, which were eligible for our data collection. Thus, the keyword ‘COVID-19 vaccine efficacy’ was used to search on YouTube. As indicated by previous studies (Basch et al., 2020; Jamison et al., 2020), most YouTube videos and the most reliable topics of the COVID-19 vaccine were uploaded by news media. We, therefore, limited our selection of video uploaders to accredited mainstream news outlets. Search results were sorted by view count to obtain the most widely viewed YouTube videos in descending order at the time of data collection. Videos that comments function was turned off and comment number was less than 500 posts were excluded. The URLs of each video, video title, and the number of posts were documented in Supplementary Appendix 2. With the list of URLs, we used Botster (ex. Seobot) to scrape YouTube video comments. All collected data were stored in Excel spreadsheets. We removed user nicknames, user URLs, and dates to ensure anonymity and avoid potential bias from data interpretation. Both original comments and replies under original posts were kept for data analysis. We focused on studying YouTube users’ comments, and the criteria for the inclusion of user comments are twofold: i. contents are in English, ii. comments that focus on the vaccine instead of the virus itself. We independently screened all comments and removed hyperlinks and advertisements. Figure 1 illustrates the data collection process. A total of 43,203 data entries were returned after we went through the exclusion process. Similar to previous studies’ approaches (Basch et al., 2020; Du et al., 2020), this study does not involve human subjects. The data collected are publicly available, and social media user comments are posted on publicly available websites. This study was exempt from the institutional board review.

Fig. 1: The flowchart of data collection process.
figure 1

It illustrates the design of data collection. Keywords to search relevant videos and video exclusion criteria were established for selecting videos. Once we selected videos, we scraped video comments and excluded specific comments with reasons. A number of 43,203 comments are used for data analysis.

SAS text analytics

We performed text analytics using SAS Text Miner (9.4) with a corpus of large amounts of textual data (documents in SAS Text analytics). The datasets were added as input sources in the abovementioned software. There are four typical processes in text analytics, including input data, text parsing, text filter, and text cluster. The text parsing node in the software was added to extract, clean, and create a dictionary of words from documents (Chakraborty et al., 2013). Each document was divided into tokens (terms). These tokens were listed in the matrix of term-by-frequency, which was used to identify the most frequently occurring words and a number of documents where each word occurred. Meanwhile, the text parsing process detected and determined parts of speech and removed stop words (e.g., and, a, and also) (Chakraborty et al., 2013). Table 1 shows the excerpts of the text parsing results. For instance, some of the most frequently occurring words are ‘vaccine,’ ‘people,’ ‘take,’ and ‘virus’, which makes sense as these words are commonly used in discussing vaccine efficacy provided in the scope of this study.

Table 1 Excerpts of the text parsing results.

After the text parsing node, the text filter node was added to filter out terms that occurred the least number of times in the collection of documents. Similar terms were formed into groups using synonyms. In other words, grouped terms closely related to each other were characterised by separate clusters of related terms. Thus, terms within one group were similar, and terms between groups were dissimilar (Chakraborty et al., 2013). Term Frequency-Inverse Document Frequency (TF-IDF) algorithm in Supplementary material 1 was used to evaluate how relevant a word is to a document in a collection of documents. Table 2 provides the matrix of documents of terms dropped and kept in the process of text filter.

Table 2 Excerpts of the text filter results.

The text cluster node worked on the basis of identified terms and their frequency of occurrence in the corpus of documents and within each document. At this step, singular value decomposition (SVD) was used to transform the original weighted term-document-matrix into a dense but less dimensional representation (Chakraborty et al., 2013). In other words, the clustering results via SVD on the term frequency document were set to generate enough SVD dimensions (k) for further analysis. The greater the number of dimensions (k), the higher the resolution to the term frequency clustered. In this study, k was set to 50, which was the default setting in the SAS software. The k-dimensional subspace was then generated from the cluster analysis via SVD. The cluster frequency root mean square standard deviation shown in Table 3 indicates that the derived clusters via SVD were well manifested and optimal as the values are close to 0. Regarding the number of clusters, we made some trial-and-error in the properties setting of text cluster in an attempt to generate well-separated clusters. A 40-cluster solution using a hierarchical clustering algorithm was set in the text cluster node. The results showed a set of descriptive terms of each cluster in Table 3, which revealed to a certain extent of the theme of each cluster.

Table 3 Text cluster results.

Figure 2 presents the cluster hierarchy graph. The distance between the clusters depicts the relationships between these clusters. The further the clusters are from each other; the less likely the key themes between the clusters are associated. In short, the distance between clusters manifests the association amongst the key themes. The endpoints of hierarchical clusters are sets of subclusters (e.g., 19 Pfizer and 21 Trump in Fig. 2), where each subcluster is distinct from other subclusters. Regarding clusters, descriptive terms in each cluster (e.g., Cluster 20 in Table 3) are broadly similar to each other, evidenced by shared terms (Chakraborty et al., 2013). In this study, for instance, democrats won and right related to the term political ideologies. The cluster perceived trust in pharma was mainly associated with companies and billion.

Fig. 2: Cluster hierarchy graph.
figure 2

It presents cluster hierarchy in the study. Clusters are represented as nodes. The distance of two clusters is modelled by a certain weight on the edge linking the nodes.


The text analytics we performed in the methods section produced eleven different clusters. These clusters are reflected in the most frequent key terms used by the YouTube video commenters. We used an inductive approach to interpret each cluster’s theme. This is accomplished by individually reviewing sentences extracted from YouTube video comments. As suggested by Glowacki et al. (2017), we compared our assessments, and disagreements were resolved by group discussions to arrive at this final description. We then explained each cluster with excerpts of YouTube comments. We interpreted data clusters with a narrative approach to understand meaningful human experiences (see Supplementary material 2). We labelled each cluster based on the most frequently used terms and their weights calculated by SAS Text Miner. Table 4 shows the summary of cluster interpretation in descending order of cluster frequency indicated in the data analysis.

Table 4 Summary of text cluster interpretation.

Traditionally, identifying and grouping themes requires several rounds of coding, adding, and consolidating codes in order to reach a point of adequate saturation. This approach is based on code appearances across the corpora. Consistent with calculating term frequency but distinctively different from the traditional coding method, this study adopted text clusters derived from a predetermined algorithm embedded into SAS Text Miner, which was deemed accurate and reduced bias compared to human coding. We then conducted supervised learning with these identified clusters by independently reading YouTube video comments and labelling these clusters. In line with the traditional survey item designs, we labelled clusters based on measurable items in statistical analysis, such as vaccination intention and perceived trust. We first read descriptive terms of each cluster and carried out an initial reading of about 20% of comments (n = 8000) independently among researchers. This step allows us to understand a narrative trope of our data regarding the COVID-19 vaccines. At a later stage, we carried out reading of the remaining 80% of comments (n = 35,203) independently. The only manually coding process in this study was labelling identified clusters. Based on the literature review and comments reading in the earlier stages, we managed to label each cluster by continuously corroborating the insights from the dataset. We also constantly triangulated the data with previous studies and the results of this study through virtual discussions due to the lockdown. One of the researchers conducted a quality check to ensure the assumptions and preconceptions of other researchers had not biased the labelling.

The criteria of labelling clusters are using the most frequently occurring descriptive terms produced from the text cluster algorithm. Regarding Cluster 20, the majority of YouTube video audience were not willing to get vaccinated. The likelihood of taking recommended preventive health action is an essential means of measuring health-related behaviours in the Health Belief Model (HBM) (Janz and Becker, 1984; Rosenstock et al., 1988). In this study, we thus labelled Cluster 20 as intention to get COVID-19 vaccines. Cluster 21 was identified as political ideologies. Previous study findings revealed that Republican (Red) states had significantly lower unadjusted HPV vaccine series initiation and completion rates (Suryadevara et al., 2019). A recent survey study has found that participants were willing to get vaccinated if they were moderate or liberal in their political leaning (Reiter et al., 2020). We included political ideologies as one important factor in influencing preventive health behaviours. A content analysis of anti-vaccine websites has shown a great deal of mistrust in multiple entities (e.g., government, pharmaceutical companies, and healthcare providers) expressed by online users. The values and beliefs of lying and irresponsible entities contributed to the distrust of vaccine advocate organisations (Moran et al., 2016). One recent survey study examined the impact of trust in pharmaceutical companies on COVID-19 vaccination acceptance (Wong et al., 2021). We thus included perceived trust in pharma as one crucial contextual factor of health-related behaviours. Cluster 10 was labelled as perceived trust in government. We included this cluster as another contextual factor of health-related behaviours. Some YouTube video viewers believe that the Mainstream Media (MSM) is lying and fear-mongering. As indicated by Moran et al.’s (2016) study findings, mistrust in mass media is one important contextual factor in predicting vaccination against vaccine-preventable diseases. We thus labelled Cluster 14 as perceived trust in media.

According to Rosenstock et al. (1998), perceived barriers refer to negative components of an anticipated behaviour, including financial costs, physical barriers, side-effects, and accessibility factors (Rosenstock et al., 1988). Perceived barriers, such as vaccine safety concerns and costs, were significant predictors of vaccine uptake (Dubé et al., 2013; Gerend and Shepherd, 2012). Cluster 18, labelled as perceived barriers, was used to measure the impact of vaccine safety concerns on vaccine hesitancy in this study. We identified Cluster 23 as vaccine misinformation. We observed that misinformation regarding COVID-19 vaccines was closely related to anti-vaccine conspiracy theories. In contrast to anti-vaccine groups, some YouTube users urged others to stop spreading misinformation and get vaccinated to protect people and save lives. Perceived severity refers to a concern about how threatening the condition is to the person. In the case of vaccines, previous studies have proved that perceived severity was strongly associated with HPV vaccine uptake (Gerend and Shepherd, 2012), and willingness to get vaccinated against COVID-19 (Reiter et al., 2020). We thus hypothesised that the perceived severity of COVID-19 infection would be an important factor in influencing hesitancy. Given the descriptive terms generated from unsupervised machine learning, we confirmed that the keyword effectiveness reflected people’s perceived benefit of the vaccine during the supervised learning. Thus, we labelled cluster 9 perceived benefit. Perceived benefit focuses on the belief regarding the effectiveness of a specific behaviour or alternative behaviour in preventing disease, maintaining health, and lessening undesirable consequences of the disease (Rosenstock et al., 1988). Perceived benefit, for instance, safety and effectiveness of a vaccine, accounted for the major variations in seeking vaccines. Gerend and Shepherd (2012) examined the relationships between HBM constructs (e.g., perceived benefit) and HPV vaccine acceptability (Gerend and Shepherd, 2012). Positive correlations of perceived benefit and COVID-19 vaccine acceptance were also found in a survey study (Wong et al., 2021). We adapted the perceived benefit to this study, intending to evaluate its impact on vaccination intention. We identified and labelled Cluster 13 as perceived susceptibility. It refers to a person’s view of the likelihood of experiencing a potentially harmful condition. Reiter et al. (2020) found that participants were likely to be willing to get vaccinated if they reported a higher level of perceived likelihood of getting a COVID-19 infection in the future. Perceived susceptibility was also measured in Sherman et al.’s (2020) study of public attitudes and beliefs in COVID-19 vaccination acceptability (Sherman et al., 2020). We adapted perceived susceptibility to examine its impact on COVID-19 vaccination intention.

Cluster 15 was labelled as vaccine misinformation. Based on our review of the COVID-19 vaccine hesitancy and social media, previous studies have found negative associations between misinformation and vaccine uptake intention. For instance, misinformation induced the decline in vaccine uptake (Loomba et al., 2021), inaccurate and negative social media data potentially influenced population-wide vaccine uptake (Basch et al., 2021a), belief in COVID-related conspiracy theories predicted resistance to preventive health behaviour and vaccination against COVID-19 (Romer and Jamieson, 2020). We intend to adapt vaccine misinformation as a contextual factor in our research framework. Moreover, we explore and measure HBM constructs and contextual factors (e.g., perceived trust in pharmaceutical companies, vaccine misinformation). We expect to observe vaccine hesitancy-induced behavioural change in the COVID-19 pandemic. Figure 3 is the proposed research model that depicts the relationships of constructs in the context of the COVID-19 vaccines. Table 5 shows the constructs of the research model.

Fig. 3: Research model.
figure 3

The proposed research model with several independent variables and one dependent variable.

Table 5 Constructs of the study model.

Reliability and validity of unsupervised and supervised learning

With the help of unsupervised machine learning, we conducted data collection and analysis, which was thoroughly and scientifically analysed. The process of understanding data refers to the initial exploration and evaluation of the collected data. Given the specific phenomenon (i.e., the COVID-19 vaccine efficacy) and study objectives, we decided to use YouTube comments generated from mainstream media outlet videos. As noted in the methods section, the cleaning data process needed to be done manually. We verified that the data collected was accurate, confirming that comments were related to the COVID-19 vaccine. This process of preparing data was thus found to be highly reliable. At the data analysis stage, we emphasised the choice and calibration of the methods to analyse data. We intended to validate the results compared to the objectives of the study. A hierarchical process was preferred in the text cluster algorithm (Fig. 2). We chose the combination of SVD and optimal k-means as it was judged to be a good way of balancing unstructured data at hand with the purpose of the study (Carnerud, 2017). The root mean square standard deviation (Table 3) showed the average distance between observations in clusters. It indicated the measure of goodness of fit of text cluster. At the stage of cluster interpretation, we consistently labelled each cluster on the basis of the frequency of occurrence of texts as labelling is an approved, preferred, and effective process of operating qualitative data (Carnerud, 2017). The supervised learning of cluster interpretation consisted of detailed notes and codes, accurate interpretation of comments, data transparency and authenticity, critical appraisal of cluster labelling. We tested validity through the convergence of information from various sources (e.g., news, previous studies). We are able to develop a comprehensive understanding of the social media ecology of the COVID-19 vaccine through the use of triangulation of qualitative research. This human judgement allowed us to validate, evaluate, and interpret generated clusters consistently and efficiently, leading to an achieved reliability and validity of this study.

Multiple regression analyses

We observe that several clusters are related to each other, making it interesting to explore them using an appropriate method. With derived clusters, we use a Cartesian coordinate system by displaying them in a scatterplot (see Fig. 4).

Fig. 4: Cartesian plot of cluster distances.
figure 4

The identified clusters were plotted according to the distance generated from a cluster algorithm.

The Cartesian coordinate measures the distance between clusters with X and Y coordinates representing space generated by the Singular Value Decomposition (SVD) algorithm. Clusters aligned together in the Cartesian system represent their similarities in semantic concepts. The values derived from text cluster (see Supplementary material 3) are the mean values. These values measure the distance of a data point to the centroid in each cluster. Given our study’s text cluster parameter setting, the iterative algorithm produced 46 ‘samples’. We believe that any changes of the distances of data points to one centroid in one cluster in the iterations would impact the distances to another centroid in another cluster. Using these values, we developed a predictive model where the independent variables such as perceived susceptibility, perceived barriers, perceived benefit, perceived trust in pharma, political ideologies, and vaccine misinformation were associated with vaccination intention. Unlike traditional survey items, there was single item/variable corresponding to each ‘sample’ for each cluster in this study (n = 46). Given the context of values generated from text clusters, we adopted observed variables representing each construct (Table 5). There are one dependent variable (i.e., vaccination intention) and ten independent variables in the proposed research model. We used multiple regression analysis to examine the impacts of the abovementioned variables on vaccination intention in this study. Multiple regression analysis is one of the most used statistical techniques to analyse the relationship between a single dependent variable and several independent variables. This regression analysis aims to evaluate the proposed research model by determining the influence of the independent variables on the single dependent variable. Given the fact that generated values were obtained from text clusters and proposed research model was hypothesised from the HBM literature and cluster labelling, multiple regression analysis is a clear viable solution to develop predictive modelling in relation to vaccination intention. The analyses were carried out using SPSS (22.0).

First, we ran the SPSS analysis to identify outliers. We followed the rule of thumb to take a boxplot approach. Supplementary material 4 is the boxplot of all clusters. It is obvious that case 1 is the outlier. After dropping the outlier, the R square value fell over 40%. The coefficients indicated that dropping this outlier did not significantly change the results or affect assumptions. Thus, we kept this outlier for further analysis.

Second, we performed linear regression analysis with all identified variables. We observed the variance inflation factor (VIF) of collinearity statistics. In a regular multiple regression, the path coefficients may be biased if the estimation involves significant levels of collinearity among the predictor constructs (Hair et al., 2010; Hair et al., 2014). Hair et al. (2014) considered VIF above 5.00 as indicative of collinearity that is too high. Hence, we removed one independent variable at a time based on the VIF values in descending order. For instance, vaccine misinformation1 was dropped, followed by perceived trust in government.

Third, after checking for collinearity, we continued to assess the nonlinearity and homogeneity of variance. The bivariate plot of the predicted value against residuals can infer whether the predictors’ relationships to the outcome are linear. The scatterplot (supplementary material 5) presents the standardised predicted value of independent variables on the dependent variable against the standardised residuals. From the Loess curve, it appears that the relationship of standardised predicted value to residuals is roughly linear around zero. It does not suggest a departure from linearity, and we would expect the Loess line to come close to the regression line. Moreover, the residuals seem to be randomly scattered around zero, we thus conclude that the linearity assumption is satisfied, and the heteroskedasticity assumption is satisfied if we run the fully specified predictive model (Cohen et al., 2003; Keith, 2019).

Fourth, we tested the normality of residuals using a normal probability plot (more specifically, a P–P plot). The P–P plot (supplementary material 6) compares the observed cumulative distribution function (CDF) of the standardised residual to the expected CDF of the normal distribution. In addition, the quantile–quantile (q–q) plot (supplementary material 7) compares the observed quantile with the theoretical quantile of a normal distribution. As can be seen from the Q–Q plot graph, the residuals conform fairly well to the diagonal straight line. We conclude that the residuals are normally distributed (Cohen et al., 2003).

Fifth, we performed multiple linear regression analyses to elucidate whether identified clusters/variables (e.g., political ideologies, perceived susceptibility, or perceived trust in pharma) or an interaction of these variables were predictors of vaccination intention among social media users. We focused on the total effects of variables (in particular, the prediction purpose) on an outcome (Keith, 2019). The results of multiple regression analyses (see Table 6) suggested two predictor variables were significantly related to vaccination intention. The R2 of 0.747 indicates that 74.7% of the variance can be accounted for by its linear relationship with the predictor variables. An R2 of 0.700 is described as close to substantial (Hair et al., 2010). It indicates that the regression model fits the observed data well. Adjusted R2 = 0.692, F (8, 37) = 13.628, p < 0.001. According to the standardised weight β in the following table, the regression equation is as follows:

$$\begin{array}{l}{\mathrm{Vaccination}}\,{\mathrm{intention}} = 0.379_{{\mathrm{susceptibility}}} + 0.371_{{\mathrm{ideologies}}} + 0.182_{{\mathrm{barriers}}}\\ + 0.174_{{\mathrm{media}}} + 0.137_{{\mathrm{pharma}}} + 0.060_{{\mathrm{misinformation}}2} - 0.119_{{\mathrm{severity}}} - 0.129_{{\mathrm{benefit}}}\end{array}$$
Table 6 Multiple regression analyses for the COVID-19 vaccination intention predictors.

The coefficients matrix provides information about the relative influence of independent variables. It showed that perceived susceptibility statistically influenced vaccination intention (β = 0.379, t = 2.981, p = 0.005). In other words, the perceived likelihood of getting infected with COVID-19 could influence their intention to get vaccinated. In addition, political ideologies is related to vaccination intention (β = 0.371, t = 2.663, p = 0.011). It indicates that social media users’ political ideologies could influence their intentions to get vaccinated against COVID-19. The results also showed that perceived trust in pharma, perceived trust in media, perceived barriers, perceived benefit, perceived severity, and vaccine misinformation2 were not significantly associated with vaccination intention. To this end, our cluster interpretation and predictive model generate meaningful results that provide us with better understandings of vaccine hesitancy factors and HBM constructs in influencing COVID-19 vaccination intention.


In previous sections, we interpreted and labelled eleven clusters generated from SAS Text analytics. Based on the thematic analysis, we adapted vaccine hesitancy factors and other contextual factors into the HBM constructs in an attempt to develop a research model regarding intention to get COVID-19 vaccines. We investigated a variety of casual relationships among the constructs by employing the multiple regression analysis method. Now we will discuss our major findings of COVID-19 vaccine hesitancy factors and public perceptions and behaviour changes. We will include theoretical and practical implications in this section as well.

Major findings

Using individual-level social media data of 43,203 YouTube comments, we reveal a number of important findings to policy-makers and stakeholders engaged in the public health sector. First, we find that the social media ecology of vaccine hesitancy is evolving. Polarized opinions and contrasting insights characterise the public discourses in this ecology. The profusion of true and false information provides us with a granular understanding of vaccine hesitancy in the social media context, especially amid the COVID-19 pandemic. Apart from vaccine opponents and proponents, a sizable proportion of YouTube users hesitated to get vaccinated against COVID-19. Consistent with previous studies (Basch et al., 2021a; Wong et al., 2021), their reasons included concerns about side-effects, effectiveness, and lack of trust in corporations and government. In addition, this study’s thematic analysis findings conveyed that public’s perceived trust in media influenced public’s vaccination intention. We thus argue that big data analytics of social media posts is a useful tool to aptly generate extensive research results for healthcare professionals to monitor and advance their strategies in improving COVID-19 vaccine uptake.

Second, this study identified reasons behind vaccine hesitancy toward COVID-19 vaccines among YouTube audience, including concerns about vaccine safety, effectiveness, lack of knowledge of COVID-19, and distrust of companies. We categorised labelled clusters into constructs: perceived susceptibility, perceived barriers, perceived benefit, perceived severity, perceived trust in pharmaceutical companies, perceived trust in government, perceived trust in media, political ideologies, vaccine misinformation, and vaccination intention. Perceived trust in companies, government, media, and political ideologies are newly emerged themes compared to previously reviewed articles (Basch et al., 2021a; Hussain et al., 2021). We also observed and identified several health belief model constructs. Similar to previous survey findings (Gerend and Shepherd, 2012; Wong et al., 2021), perceived susceptibility, perceived barriers, and perceived severity played important roles in the public’s decision-making process. Prior studies have measured the impact of vaccine misinformation on the population’s vaccine uptake (Loomba et al., 2021; Romer and Jamieson, 2020). Similar to these findings, an additional theme, vaccine misinformation, was also presented in this text analytics study.

Third, based on the multiple regression analyses results, we found the relative importance of health belief model constructs and vaccine hesitancy contextual factors. The top significant factors influencing vaccination intention included perceived susceptibility and political ideologies. This ranking of factors allows healthcare professionals to facilitate target vaccination campaigns. The results showed that perceived susceptibility and political ideologies were strongly associated with COVID-19 vaccination intention. In other words, we can use HBM constructs to explain and estimate COVID-19 vaccine uptake behaviours. In particular, perceived susceptibility could influence social media users’ willingness to get vaccinated. The result of perceived susceptibility influence is aligned with previous studies (Gerend and Shepherd, 2012; Romer and Jamieson, 2020) but contrasts with Wong et al.’s (2021) study. The explanation stated that the COVID-19 was considered a mild disease (Wong et al., 2021). Furthermore, our current analysis suggested that social media users have a partisan preference with regard to vaccine information promulgated on YouTube despite the quality of sources. These users tend to choose sources that reflect a similar ideology to their personal beliefs (Suryadevara et al., 2019). One survey study found no evidence that political ideologies were directly related to vaccination intention (Romer and Jamieson, 2020). However, as evidenced by the regression results, YouTube audience whose political leaning toward conservatives are less likely to get vaccinated. Moreover, the right-wing has more tendencies to reduce physical distancing (Gollwitzer et al., 2020), promote vaccine misinformation, and these echo chambers amplify anti-vaccine beliefs in social media platforms (Larson, 2020; Thelwall et al., 2021). Additionally, social media users identified as political conservatives expressed less degree of trust in the government and experts. The vaccine issue is deeply ingrained in politics. Verger and Dubé (2020) noted that vaccine distrust was enmeshed in social and political protest and criticism of vaccines became the hobby horse of opposition parties (Verger and Dubé, 2020). This is particularly troubling because the government and its agencies are the key disseminators of vaccine information. Hence, maintaining and restoring public trust is a necessary component of successful health communication strategies (Moran et al., 2016). Similar to previous studies’ findings (Loomba et al., 2021; Romer and Jamieson, 2020), our results cannot draw causal links between perceived trust in media and vaccination intention. However, the data interpretation indicated that mainstream media YouTube channels may be affecting the public responses to vaccine uptake.

Theoretical and practical implications

This study makes important contributions to the research methods and theory. Our method of applying text analytics via frequencies and weights of words using social media data is an important contribution. With user-generated contents (texts) transformed into a set of numbers, we are able to use data mining algorithms to generate insights for predictive modelling (Chakraborty et al., 2013). We believe this is a unique combination of big data text analytics and predictive modelling in the research of vaccine hesitancy. Another key contribution of this study is to empirically demonstrate the relevance of HBM constructs and vaccine hesitancy factors in the context of COVID-19 vaccines. Chiming with theoretical constructs, perceived susceptibility, perceived barriers, perceived severity, and perceived benefit were tested in previous survey studies but have not been tested on a large scale using real-time data. The evidence of perceived susceptibility and political ideologies having effects on vaccination intention validated these factors in the pandemic. The examination of vaccine misinformation as part of vaccine hesitancy factors is an extension of the HBM constructs. Furthermore, this study provides an extensive view of the social media ecology of COVID-19 vaccines. With individual-level vaccine perceptions illuminating pathways, we believe our research contributes to our understandings of existing beliefs that influence how people receive information, not only political-related information but also healthcare-related information.

This study has implications for practice. We provide necessary knowledge about vaccine hesitancy factors that played roles in the public’s decision-making of COVID-19 vaccine uptake, including political ideologies. As evidenced by previous studies in France (The COCONEL Group, 2020) and Europe (Kennedy, 2019), positive connections were found between political beliefs and attitudes toward vaccines. In the case of COVID-19 vaccines, political parties and politicians were found to suppress science, denigrate health experts, and peddle disinformation (Abbasi, 2020; Fidler, 2020). For instance, former US president Trump manipulated the Food and Drug Administration (FDA) to approve drugs hastily (Abbasi, 2020). Some expressed concerns that political appointees may insist on the Emergency Use Authorisation vaccine over the recommendation of FDA scientists (Bauchner et al., 2020). Moreover, the UK’s pandemic response relies heavily on scientists and other government appointees with worrying competing interests (Abbasi, 2020). Such interferences from governments would present a risk to the public and cause further erosion of public trust in the government’s ability to make critical scientific decisions (Bauchner et al., 2020). It leads to dwindling vaccine uptakes in the long run. The best approach for government’s pandemic responses in exceptional times is the independence of science agencies and politicians kept from the pronouncements of vaccine-related issues, which can create the appearance of interferences (Lurie et al., 2020). Abbasi (2020) has proposed two steps to safeguard science from politicisation: full disclosure of competing interests from government, politicians, scientific advisors, and appointees; and full transparency of decision-making systems and processes. Politicians have a higher responsibility to the public (Abbasi, 2020). Hence, a root-and-branch reconstruction of political interests on infectious diseases is required in developing and rebuilding health policies sufficiently (Fidler, 2020).

Given a strengthening anti-vaccine movement and continued politicisation of vaccines, public health organisations can utilise this study’s methods and findings of vaccine hesitancy factors to facilitate target intervention strategies of combating vaccine hesitancy and vaccine misinformation. First, practitioners are encouraged to take a creative approach to communicate with the public on social media platforms instead of traditional framed messages, authoritative and fact-filled, in promoting vaccines. For instance, online health communication should provide harm and benefit information. Personal narratives such as storytelling style are found to harbour public interests and reinforce public trust from experts or relatable peers (Thomas and Pollard, 2020). A key action for public health professionals is to provide statistical strength of evidence in an anecdotal and palatable way in social media accounts. It is critical to construct a trustworthy information environment that engaging in evidence-based health communication strategies (Thomas and Pollard, 2020). Moreover, practitioners are advised to take vaccination promotions out of medical settings and outreach to social media platforms where the public obtain and share health-related information. They should be cognizant of the negative effects of anti-vaccine on undecided population segments in vaccine uptake. Additional efforts are needed to keep track of vaccine sentiment and take pre-emptive measures to mitigate any drawbacks. Second, rather than rebutting conspiracy theories, Ahmed (2021) believed practitioners should inoculate against this vaccine misinformation and join local communities in the Facebook groups to offer answers to community members about COVID-19 vaccines (Ahmed, 2021). Meanwhile, online users should resist falling into the trap of engaging in vaccine misinformation and learn to share good information from trusted sources. To effectively address misinformation and public concerns, Thomas and Pollard (2020) recommend public health organisations and practitioners adapt to this digital era and utilise social media platforms to point out logical flaws or malicious intent of anti-vaccine arguments. This online presence of immunisation dialogue is useful in dispelling misconceptions, addressing concerns, and improving vaccination coverage. Lastly, social media companies in 2019 have pledged to act against the anti-vaccine movement, such as Facebook announced that it would not recommend misinformation content, and YouTube removed advertisements from anti-vaccine videos. Ahmed (2021) offers a simple and sterner solution to combat anti-vaccine misinformation, to remove anti-vaccine misinformation super spreaders. Another study recommends social media firms provide visible rewards for users who generate reliable materials, such as adding a ‘trust button’ and displaying the number of ‘trust clicks’ as posts received (Sharot, 2021). To this end, we believe that strategic intervention initiatives by public health organisations and social media companies can counteract the spread of COVID-19 vaccine misinformation.

Limitations and future studies

This study is delimited in scope in the following ways. First, the data collected was limited to the period of announcements of COVID-19 candidate vaccine efficacy in November 2020. This period coincided with the 2020 US election. We thus expected that a kerfuffle over political views and controversial political figures. In addition, future studies can be conducted to observe longitudinal behavioural changes with different pivotal points in the vaccine development process and vaccination programs. Second, the sampling bias may exist due to the selected social media data. Future studies are encouraged to collect a greater amount of social media data from multiple platforms in an attempt to reduce method bias. Meanwhile, we believe understanding current social media users’ attitudes would play a critical role in predicting future generations’ vaccination intention. Third, we did not have the data on YouTube users’ demographic information while preserving users’ anonymity during the data collection. We thus cannot analyse the correlates of demographic characteristics (e.g., age, gender, and ethnicity) to their vaccine-hesitant behaviours. Fourth, as our approach of using text analytics in predictive modelling in vaccination intentions is relatively novel, future studies can be conducted to develop survey scales for vaccine hesitancy factors in an attempt to assess emerging contextual factors and validate the proposed predictive modelling. With these survey scales, we encourage researchers to collect a larger survey sample size to reduce factor dimensions, investigate effect size (f2), and explore reasons for unsupported statistical relationships in the current study. We applied a parsing and stemming process using the tokenization based on the SAS dictionary-based stemmer. To reduce probabilities of cluster overlapping, it is advisable for researchers to train and analyse textual data using an n-gram algorithm, where the computation of linguistics involving two or more terms is possible. In particular, this approach will require a more complex algorithm and a large amount of computational power and resources. Lastly, cues to action construct was not included in identified themes because our independent review of all data found no evidence of this often neglected HBM construct. This factor refers to the signals and reminders from doctors and nurses to individuals to engage in health-related behaviour. We suggest obtaining relevant data in future studies to examine the effects of cues to action on influencing public vaccine hesitancy and vaccine uptake via personal health beliefs.


In summary, this study identified the YouTube audience’s perceptions of COVID-19 vaccines. Our data interpretation provides major reasons why YouTube audience may feel hesitant or refuse to receive a COVID-19 vaccine. These reasons fell under the following themes: perceived susceptibility, perceived severity, perceived benefit, perceived barriers, perceived trust in pharmaceutical companies, government, media, political ideologies, and vaccine misinformation. We offer these findings hoping that public health practitioners can implement more effective public health communication strategies and tailored interventions in the social media platforms. A better understanding of various social, political, and psychological factors discussed in the study can be used to maximise the positive effects of public health communications. The engagement and participation of key stakeholders are required to convince a sufficient proportion of individuals to get vaccinated against COVID-19, thereby achieving the threshold necessary for herd immunity.