Atherosclerotic cardiovascular disease (ASCVD) remains the leading cause of death in the United States1. Earlier identification and intervention of ASCVD is critical for reducing its morbidity and mortality, as over a third of all ASCVD deaths occur in individuals with no prior symptoms1. Detection of coronary artery calcification (CAC) by a specialized computed tomography (CT) scan (“CAC scan”) can help guide patient and clinicians on shared-decision making around cardiovascular risk assessment2. As such, CAC scans are endorsed by multiple medical societies as power tools for personalizing cardiovascular risk and preventive therapy recommendations3,4. CAC may also be a strong motivator for improving health behaviors, including lifestyle changes and adherence to preventive therapies like statins5,6.

While public interest in CAC has grown over time, current public perceptions about CAC are not well-described7. Understanding these beliefs about CAC is critical, as it may help frame shared decision-making discussions and guide public health interventions around ASCVD. Artificial intelligence (AI)-enabled analysis of large volumes of social media data can provide an efficient approach for analyzing contemporary public opinions on common health-related topics and allow for a systematic evaluation of emerging themes8. Reddit is a free and widely used social media platform with over 52 million daily active users and over 30 billion views every month9. In this study, we leverage an artificial intelligence pipeline using natural language processing and unsupervised learning to characterize real-world perceptions about CAC using discussions on Reddit.

We extracted a total of 5606 unique CAC-related discussions (1017 posts, 4589 comments) from 3545 unique users across 990 subreddits from March 29, 2008, through May 21, 2023 (Supplementary Fig. 1). The largest number of discussions from a single author was 26, while 3463 (97.7%) authors contributed less than six discussions each. The subreddits with the most discussions were r/keto (7.5%), r/Cholesterol (7.0%), and r/AskDocs (5.8% of all discussions). The number of CAC-related discussions increased by an average of 57.2% yearly. Using a pretrained, sentence-level Bidirectional Encoder Representations from Transformers (BERT) model, we embedded these discussions into a vectorized language space, in which they were further dimensionally reduced and clustered to identify a total of 91 topics (Fig. 1). These topics were further clustered to identify 14 overarching groups. The largest topics and groups centered around CAC testing to evaluate symptoms (e.g., palpitations, chest pain, and anxiety) and de-risking non-ischemic cardiovascular disease (groups 1, 5); interpreting CAC scores in the context of lifestyle and lipid results (groups 2, 4, 8, and 10); and the disadvantages of CAC testing (e.g., financial cost, radiation risk) (Table 1). Other notable topics included indications for CAC testing (e.g., topics 10, 27, 42, 48), CAC and statins (e.g., topics 24, 31, 34), ketogenic diets can affect CAC (e.g., topics 16, 19), radiation exposure risk (e.g., topics 22, 45), insurance issues (e.g., topics 29, 30, 37, 50), and celebrities with CAC (e.g., topics 43, 55, 56). A separate pretrained BERT model was used to analyze the sentiment of each discussion, uncovering that 49.5% of discussions were neutral, 48.4% were negative, and 2.1% were positive. The average sentiment of all discussions remained stably neutral-to-negative (−0.42 – −0.50) each year from 2013 through 2023 (Supplementary Fig. 2).

Fig. 1: Topic modeling revealed 91 topics and 14 groups.
figure 1

Visual representation of the topics (circles) and groups (color); the size of each topic (circle) represents the relative number of discussions grouped in that topic. The x- and y-axes represent the two Uniform Manifold Approximation and Projection features onto which topics were dimensionally reduced to allow their visualization. CAC coronary artery calcium, CCTA computed tomography (CT) coronary angiography.

Table 1 Overview of Groups of Topics With Example Text

Our AI-enabled analysis of public perceptions of CAC testing demonstrates how well our previously described algorithm for topic modeling generalizes to another clinical domain8. A powerful aspect of our pipeline is leveraging techniques in unsupervised machine learning that obviate the need for topic prespecification, which allows discovery of previously unexpected ideas (e.g., non-evidence-based use of CAC). Such topic modeling analyses can also provide clinical insights that may be further explored to test generated hypotheses. By harnessing the power of AI on pre-existing datasets, we demonstrate a fast, inexpensive method of gathering public opinions that would otherwise require time- and finance-intensive clinical registries and user surveys to collect. Through this efficient extraction and interpretation of large volumes of social media data, AI also offers the ability to continuously evaluate public sentiment over time, monitor for emerging topics, and stream clinical insights to key stakeholders that could impact clinical care.

Our study revealed several noteworthy insights about public perceptions around CAC testing. First, CAC testing had a strong impact on therapeutic decision-making. Many discussions emphasized the power of a CAC score of zero as way of de-risking individuals and avoiding statin therapy. While a CAC-based de-escalation strategy is supported by practice guidelines, the presence of other risk-enhancing lifestyle or clinical factors (e.g., diabetes) may affect these decisions10. Conversely, many discussions where a non-zero CAC was noted demonstrated how these findings helped motivate lifestyle changes. Ultimately, CAC interpretation is nuanced, and our study highlights that public discussions around interpretation of CAC results may not always be guideline-concordant, underscoring the need for patient and clinician shared-decision making.

Second, there were several discussions surrounding non-evidence-based uses of CAC testing, including for evaluation of patients with cardiac symptoms, such as chest pain and palpitations. This may be discordant with current clinical guidelines, which endorse the use of CAC testing in primary prevention among asymptomatic patients, particularly those with intermediate ASCVD risk3. Many discussions also misattributed the negative predictive value of a CAC scan to evaluate non-specific symptoms typically not related to ASCVD risk assessment, which may further misrepresent the current indications for CAC to the public. Future work may focus on evaluating the dynamics of how such misinformation can be amplified in social media frameworks and ultimately help determine optimal strategies for containing their spread.

Third, we identified discussions regarding the disadvantages of CAC testing, including out-of-pocket costs due to lack of insurance coverage and radiation exposure. However, many individuals still found value in CAC testing despite costs and radiation. The cost-effectiveness of CAC has been reported elsewhere in the literature11. Although the radiation risk associated with CAC testing is minimal, similar to ambient radiation from living in large cities12, our work identified that patients may be concerned about this risk when deciding to pursue CAC testing.

Finally, we found that the sentiment around CAC-related discussions was mostly neutral-to-negative. This is consistent with prior studies evaluating healthcare discussions on Reddit, which identify a negative tone and expressions of sadness, fear, and anger that is believed to reflect the underlying patient experience in a complex healthcare environment13. This negativity bias is well reported in the media and can impact health outcomes14, suggesting the importance of public health efforts to moderate misinformation15.

This study should be interpreted in context of its limitations. Discussions in this study reflect views of Reddit users, who have historically been younger and may not be broadly representative of patients at high risk of ASCVD16; however, CAC testing is most appropriate for lower and intermediate risk individuals. While a variety of search terms were used, this dataset may not capture all CAC-related discussions on Reddit if individuals use other terms to refer to CAC. Clustering techniques we employed may reflect linguistic concordance to determine similarity rather than clinical concordance, which may lead to seemingly redundant topics and groups. This limitation highlights how AI can augment, but not replace, researchers in analyzing large datasets, and opens the door to consider how more advanced NLP techniques, like large language models, can improve this pipeline.

In this AI-enabled qualitative study of discussions on Reddit, we identified contemporary public perceptions and sentiments around CAC, which included the impact of CAC on therapeutic decision-making, non-evidence-based use of CAC testing, and the perceived downsides of CAC testing. The themes uncovered from this study highlight potential areas of patient concern and misinformation that can be addressed to improve shared decision-making around ASCVD management, improve statin adherence rates, and reduce ASCVD morbidity and mortality.

Methods

Dataset

Reddit (www.reddit.com) was used as the data source for this study17. It is composed of communities called ‘subreddits’ which are prefixed by “r/” and are focused on specific topics (e.g., r/AskDoctors, r/WorldNews, r/Keto). Users may interact with the platform by creating a “post” to initiate a new discussion thread and by commenting on other users’ posts as part of discussions (“comments”). Most subreddits, including all posts and comments contained within them, are openly accessible and visible without having to create a Reddit user account.

To create a list of CAC-related discussions from Reddit, an Application Programming Interface (API) called PushShift was used to search all the posts and comments on Reddit for case-insensitive matching on the following commonly used terms for CAC scans: “coronary artery calcium”, “coronary calcium”, “cac score”, “calcium score”, and “heart scan”7,18.

This study was deemed exempt from ethical review since it did not involve human subjects as defined in 45 United States’ Code of Federal Regulations (CFR) 46.102(f) or 21 CFR 50.3(g).

Data analysis

Details around topic modeling and sentiment analysis in this paper are described elsewhere8. Briefly, after preprocessing, discussions are embedded into a numerical representation using a pretrained, sentence-level Bidirectional Encoder Representations from Transformers (BERT) model called all-MiniLM-L6-v219, which has been trained on over 600 million Reddit posts and a dataset containing over 12 million papers from medical journals. This embedding was then simplified into a smaller representation using the Uniform Mapping Approximation and Projection algorithm to improve clustering performance into topics using Spectral Clustering. Since topics may be similar in content but be differentiated by other embedded features from the model (e.g., linguistic style, tone), a subsequent clustering analysis was performed to find overarching themes of discussion (“groups”). The number of topics and groups were automatically determined based on optimizing the Silhouette Coefficient and Davies-Bouldin Index, which are mathematical measures of how similar discussions are within a cluster relative to how similar those discussions are to those in other clusters. A separate BERT model, RoBERTa, pretrained on social media posts, was used to classify sentiment (i.e., “positive”, “neutral’, or “negative” classification of text).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.