Introduction

In recent years, there has been a paradigm shift in recognizing the need to structure health services, so they are organized around the individuals seeking care, rather than on the disease1,2. Integrated person-centered health services require that individuals be empowered to take charge of their own health, and have the education and support they need to make informed health decisions3. Over the last decade, the internet and the use of health apps have emerged as critical spaces where individuals are accessing health information. In the U.S., over 70% of the population uses the Internet as a source of health information4. A 2017 study in Sub-Saharan Africa, reported that 41% of Internet users used the Internet for health information and medicine5, highlighting the value of the internet as a global health information source. Concurrently, there has also been a proliferation of health-related apps, with an estimated 318,000 health apps available globally in 20176. To facilitate two-way health communication and center care around the individual user, apps are integrating conversational agents or “healthbots” within the app7,8.

Healthbots are computer programs that mimic conversation with users using text or spoken language9. The advent of such technology has created a novel way to improve person-centered healthcare. The underlying technology that supports such healthbots may include a set of rule-based algorithms, or employ machine learning techniques such as natural language processing (NLP) to automate some portions of the conversation. Healthbots are being used with varying functions across a range of healthcare domains; patient-facing healthbots9 are largely focused on increasing health literacy10, mental health (i.e., depression, anxiety)11,12,13,14,15,16, maternal health17,18,19, sexual health and substance use20, nutrition and physical activity21,22,23, among others.

The use of healthbots in healthcare can potentially fill a gap in both the access to and quality of services and health information. First, healthbots are one way in which misinformation could be managed within the online space by integrating evidence-based informational bots within existing social media platforms and user groups. In a 2018 American Association of Family Physicians survey of family practitioners, over 97% of providers noted discussions with patients regarding inaccurate or incorrect health information from the Internet24. Second, healthbots can also be used to triage patients presenting with certain symptoms, as well as to provide additional counseling support after a clinical encounter, thereby reducing the burden on the healthcare system, while also prioritizing patient experience. This is increasingly important given the global shortage of healthcare workers, estimated at a deficit of around 18 million by 203025. Healthbots can be an adjuvant to clinical consultations, serving as an informational resource beyond the limited amount of time for doctors-patient interactions26. A few recent studies suggest engagement with healthbots results in improvements in symptoms of depression and anxiety11,16,27, preconception risk among African American women17, and literacy in sexual health and substance abuse prevention among adolescents20. In addition to providing evidence-based health information and counseling, healthbots may aid in supporting patients and automating organizational tasks, such as scheduling appointments, locating health clinics, and providing medication information28. While more robust evaluations of the impact of healthbots on healthcare outcomes are needed, preliminary results suggest this is a feasible approach to engage individuals in their healthcare. Often, interventions that use healthbots are targeted at patients/clients, without the active engagement of a healthcare provider. As such, healthbots may also pose certain risks. Especially in cases where such interventions employ machine learning approaches, it is important to understand and monitor the measures that are taken by developers to ensure patient/client safety. Currently, there is a lack of a clear regulatory framework for such health interventions, which may pose a range of risks to users including threats to their privacy and security of healthcare information27,29.

While healthbots have a potential role in the future of healthcare, our understanding of how they should be developed for different settings and applied in practice is limited. A few systematic and scoping reviews on health-related chatbots exist. These have primarily focused on chatbots evaluated in peer-reviewed literature9,27,30,31,32,33, provided frameworks to characterize healthbots and their use of ML and NLP techniques9,32,33, are specific to health domains (e.g., mental health;27,30,31 dementia34), for behavior change35, or around the design and architecture of healthbots9,32,36. There has been one systematic review of commercially available apps; this review focused on features and content of healthbots that supported dementia patients and their caregivers34. To our knowledge, no review has been published examining the landscape of commercially available and consumer-facing healthbots across all health domains and characterized the NLP system design of such apps. This review aims to classify the types of healthbots available on the app store (Apple iOS and Google Play app stores), their contexts of use, as well as their NLP capabilities.

To facilitate this assessment, we develop and present an evaluative framework that classifies the key characteristics of healthbots. Concerns over the unknown and unintelligible “black boxes” of ML have limited the adoption of NLP-driven chatbot interventions by the medical community, despite the potential they have in increasing and improving access to healthcare. Further, it is unclear how the performance of NLP-driven chatbots should be assessed. The framework proposed as well as the insights gleaned from the review of commercially available healthbot apps will facilitate a greater understanding of how such apps should be evaluated.

Methods

Search strategy

We conducted iOS and Google Play application store searches in June and July 2020 using the 42Matters software. 42Matters is a proprietary software database that collects app intelligence and mobile audience data, tracking several thousand metrics for over 10 million apps37; it has been used previously to support the identification of apps for app reviews and assessments38,39. A team of two researchers (PP, JR) used the relevant search terms in the “Title” and “Description” categories of the apps. The language was restricted to “English” for the iOS store and “English” and “English (UK)” for the Google Play store. The search was further limited using the Interactive Advertising Bureau (IAB) categories “Medical Health” and “Healthy Living”. The IAB develops industry standards to support categorization in the digital advertising industry; 42Matters labeled apps using these standards40. Relevant apps on the iOS Apple store were identified; then, the Google Play store was searched with the exclusion of any apps that were also available on iOS, to eliminate duplicates.

Search terms were identified leveraging Laranjo et al. and Montenegro et al., following a review of relevant literature9,32. The search terms initially tested in 42Matters for both app stores were: “AI”, “assistance technology”, “bot”, “CBT”, “chat”, “chatbot”, “chats”, “companion”, “conversational system”, “dialog”, “dialog system”, “dialogue”, “dialogue system”, “friend”, “helper”, “quick chat”, “therapist”, “therapy”, “agent”, and “virtual assistant.” 42Matters searches were initially run with these terms, and the number of hits was recorded. The first 20 apps for each term were assessed to see if they are likely to be eligible; if not, the search term was dropped. For instance, searching “helper” produced results listing applications that provided assistance for non-healthcare related tasks, including online banking, studying, and moving residence; therefore “helper” was excluded from the search. This process resulted in the selection of nine final search terms to be used in both the iOS Apple and Google Play store: “agent”, “AI”, “bot”, “CBT”, “chatbot”, “conversational system”, “dialog system”, “dialogue system”, and “virtual assistant”. Data for the apps that were produced using these search terms were downloaded from 42Matters on July 16th, 2020.

Eligibility criteria and screening

The study focused on health-related apps that had an embedded text-based conversational agent and were available for free public download through the Google Play or Apple iOS store, and available in English. A healthbot was defined as a health-related conversational agent that facilitated a bidirectional (two-way) conversation. Applications that only sent in-app text reminders and did not receive any text input from the user were excluded. Apps were also excluded if they were specific to an event (i.e., apps for conferences or marches).

Screening of the apps produced by the above search terms was done by two independent researchers (PP, JR) on the 42Matters interface. To achieve consensus on the inclusion, 10% of the apps (n = 229) were initially screened by two reviewers. The initial screening included a review title and descriptions of all apps returned by the search. This process yielded a 91% agreement between the two reviewers. Disagreements were discussed with the full research team, which further refined the inclusion criteria. Based on this understanding, the remaining apps were screened for inclusion.

All apps that cleared initial screening, were then downloaded on Android or iOS devices for secondary screening. This included as assessment of whether the app was accessible and had a chatbot function. For apps that cleared secondary screening, the following information was downloaded from 42Matters: app title, package name, language, number of downloads (Google Play only), average rating, number of ratings, availability of in-app purchases, date of last update, release date (Apple iOS only), use of IBM Watson software, Google sign-in (Apple iOS only), and Facebook sign-in (Apple iOS only). Utilization of Android permissions for Bluetooth, body sensors, phone calls, camera, accounts access, and internet as well as country-level downloads was also extracted from Google Play store only.

Data synthesis

For data synthesis, an evaluation framework was developed, leveraging Laranjo et al., Montenegro et al., Chen et al., and Kocaballi et al.9,32,41,42. Two sets of criteria were defined: one aimed to characterize the chatbot, and the second addressed relevant NLP features. Classification of these characteristics is presented in Boxes 1 and 2. We calculated the percentage of healthbots that had each element of the framework to describe the prevalence of various features in different contexts and healthcare uses. Determination of the NLP components of the app was made based on the research team using the app and communicating with the chatbot. If certain features were unclear, they were discussed with the research team, which includes a conversational agent and natural language processing expert. To understand the geographic distribution of the apps, data were abstracted regarding the highest percentage of downloads for each app per country, for the top five ranking countries. Percentages were taken of the total number of downloads per app per country, which was only publicly available for the Google Play store (n = 22), to depict the geographic distribution of chatbot prevalence and downloads globally.

This study protocol was not registered.

Results

The search initially yielded 2293 apps from both the Apple iOS and Google Play stores (see Fig. 1). After the initial screening, 2064 apps were excluded, including duplicates. The remaining 229 apps were downloaded and evaluated. In the second round of screening, 48 apps were removed as they lacked a chatbot feature and 103 apps were also excluded, as they were not available for full download, required a medical records number or institutional login, or required payment to use. This resulted in 78 apps that were included for review (See Appendix 1).

Fig. 1: Consort Diagram.
figure 1

This consort diagram shows the app selection process.

Twenty of these apps (25.6%) had faulty elements such as providing irrelevant responses, frozen chats, and messages, or broken/unintelligible English. Three of the apps were not fully assessed because their healthbots were non-functional.

App characteristics and core features

The apps targeted a range of health-related goals. Forty-seven (42%) of the apps supported users in performing health-related daily tasks including booking an appointment with a healthcare provider and guided meditation, twenty-six (23%) provided information related to a specific health area including some which provided counselling support for mental health, twenty-two (19%) assessed symptoms and their severity, sixteen (14%) provided a list of possible diagnoses based on user responses to symptom assessment questions, and two (2%) tracked health parameters over a period of time. Table 1 presents an overview of other characteristics and features of included apps.

Table 1 Characterization of the Chatbot—Health Contexts and Core Features.

Most apps were targeted at patients. Seventy-four (53%) apps targeted patients with specific illnesses or diseases, sixty (43%) targeted patients’ caregivers or healthy individuals, and six (4%) targeted healthcare providers. The total sample size exceeded seventy-eight as some apps had multiple target populations.

The apps targeted one or more health domain areas. There were 47 (31%) apps that were developed for a primary care domain area and 22 (14%) for a mental health domain. Involvement in the primary care domain was defined as healthbots containing symptom assessment, primary prevention, and other health-promoting measures. Additionally, focus areas including anesthesiology, cancer, cardiology, dermatology, endocrinology, genetics, medical claims, neurology, nutrition, pathology, and sexual health were assessed. As apps could fall within one or both of the major domains and/or be included in multiple focus areas, each individual domain and focus area was assigned a numerical value. While there were 78 apps in the review, accounting for the multiple categorizations, this multi-select characterization yielded a total of 83 (55%) counts for one or more of the focus areas.

There were only six (8%) apps that utilized a theoretical or therapeutic framework underpinning their approach, including Cognitive Behavioral Therapy (CBT)43, Dialectic Behavioral Therapy (DBT)44, and Stages of Change/Transtheoretical Model45. Five of these six apps were focused on mental health.

Personalization was defined based on whether the healthbot app as a whole has tailored its content, interface, and functionality to users, including individual user-based or user category-based accommodations. Furthermore, methods of data collection for content personalization were evaluated41. Personalization features were only identified in 47 apps (60%), of which all required information drawn from users’ active participation. All of the 47 personalized apps employed individuated personalization. Forty-three of these (90%) apps personalized the content, and five (10%) personalized the user interface of the app. Examples of individuated content include the healthbot asking for the user’s name and addressing them by their name; or the healthbot asking for the user’s health condition and providing information pertinent to their health status. In addition to the content, some apps allowed for customization of the user interface by allowing the user to pick their preferred background color and image.

We also documented any additional engagement features the app contained. The most frequently included additional feature was the use of push notifications (20%) to remind the user to utilize the app. Table 1 lists the other additional features along with their frequencies. All of the apps were available as installable software via mobile devices and tablets and eleven of them (11%) were also accessible via a web browser on laptops, desktop computers, phones, and tablets. Very few apps provided any security-type features: nine (12%) included e-mail verification, three (4%) required text verification, one (1%) provided social media verification, and one (1%) required a password to access the app. Sixty-two apps (79%) did not contain any security elements including no requirements for a login or a password.

Information about data privacy was also limited and variable. Thirteen apps (16%) provided a medical disclaimer for use of their apps. Only ten apps (12%) stated that they were HIPAA compliant, and three (4%) were Child Online Privacy and Protection Act (COPPA)-compliant. Fifty-one apps (63%) did not have or mention any privacy elements.

Geographic distribution

For each app, data on the number of downloads were abstracted for five countries with the highest numbers of downloads over the previous 30 days. This feature was only available on the Google Play store for 22 apps. A total of 33 countries are represented in the map in Fig. 2. Chatbot apps were downloaded globally, including in several African and Asian countries with more limited smartphone penetration. The United States had the highest number of total downloads (~1.9 million downloads, 12 apps), followed by India (~1.4 million downloads, 13 apps) and the Philippines (~1.25 million downloads, 4 apps). Details on the number of downloads and app across the 33 countries are available in Appendix 2.

Fig. 2: Geographic Distribution of Total Google Play Store Chatbot App Downloads, by Country.
figure 2

Source: UIA World Country Boundaries [2021]. Belgiu M., UNIGIS International Association, ArcGIS Hub.

NLP characteristics

Table 2 presents an overview of the characterizations of the apps’ NLP systems. Identifying and characterizing elements of NLP is challenging, as apps do not explicitly state their machine learning approach. We were able to determine the dialogue management system and the dialogue interaction method of the healthbot for 92% of apps. Dialogue management is the high-level design of how the healthbot will maintain the entire conversation while the dialogue interaction method is the way in which the user interacts with the system. While these choices are often tied together, e.g., finite-state and fixed input, we do see examples of finite-state dialogue management with the semantic parser interaction method. Ninety-six percent of apps employed a finite-state conversational design, indicating that users are taken through a flow of predetermined steps then provided with a response. One app was frame-based, which is better able to process user input even if it does not occur in a linear sequence (e.g., reinitiating a topic from further back in the conversation), and two were agent-based, which allows for more free-form and complex conversations. The majority (83%) had a fixed-input dialogue interaction method, indicating that the healthbot led the conversation flow. This was typically done by providing “button-push” options for user-indicated responses. Four apps utilized AI generation, indicating that the user could write two to three sentences to the healthbot and receive a potentially relevant response. Two apps (3%) utilized a basic parser, and one used a semantics parser (1%).

Table 2 Characterization of NLP system design.

We were able to identify the input and output modalities for 98% of apps. Input modality, or how the user interacts with the chatbot, was primarily text-based (96%), with seven apps (9%) allowing for spoken/verbal input, and three (4%) allowing for visual input. Visual input consisted of mood and food trackers that utilized emojis or GIFs. For the output modality, or how the chatbot interacts with the user, all accessible apps had a text-based interface (98%), with five apps (6%) also allowing spoken/verbal output, and six apps (8%) supporting visual output. Visual output, in this case, included the use of an embodied avatar with modified expressions in response to user input. Eighty-two percent of apps had a specific task for the user to focus on (i.e., entering symptoms).

Discussion

We identified 78 healthbot apps commercially available on the Google Play and Apple iOS stores. Healthbot apps are being used across 33 countries, including some locations with more limited penetration of smartphones and 3G connectivity. The healthbots serve a range of functions including the provision of health education, assessment of symptoms, and assistance with tasks such as scheduling. Currently, most bots available on app stores are patient-facing and focus on the areas of primary care and mental health. Only six (8%) of apps included in the review had a theoretical/therapeutic underpinning for their approach. Two-thirds of the apps contained features to personalize the app content to each user based on data collected from them. Seventy-nine percent apps did not have any of the security features assessed and only 10 apps reported HIPAA compliance.

The proposed framework that facilitates an assessment of the elements of the NLP system design of the healthbots is a novel contribution of this work. Most healthbots rely on rule-based approaches and finite-state dialogue management. They mainly rely on directing the user through a predetermined path to provide a response. Most included apps were fixed-input, i.e., the healthbot primarily led the conversation through written input and output modalities. Another scoping review of mental healthbots also yielded similar findings, noting that most included healthbots were rule-based with a system-based dialogue initiative, and primarily written input and output modalities33. Assessing these aspects of the NLP system sheds light on the level of automation of the bots (i.e., the amount of communication that is driven by the bot based on learning from the user-specific data, versus the conversation that has been hard-coded as an algorithm). This has direct implications on the value of, and the risks associated with the use of, the healthbot apps. Healthbots that use NLP for automation can be user-led, respond to user input, build a better rapport with the user, and facilitate more engaged and effective person-centered care9,41. Conversely, when healthbots are driven by the NLP engine, they might also pose unique risks to the user46,47, especially in cases where they are expected to serve a function based on knowledge about the user, where an empathetic response might be needed. A 2019 study found that a decision-making algorithm used in healthcare was found to be racially biased, affecting millions of Black people in the United States48. An understanding of the NLP system design can help advance the knowledge on the safety measures needed for the clinical/public health use and recommendation of such apps.

Despite limitations in access to smartphones and 3G connectivity, our review highlights the growing use of chatbot apps in low- and middle-income countries. In such contexts, chatbots may fill a critical gap in access to health services. Whereas in high-income countries, healthbots may largely be a supplement to face-to-face clinical care, in contexts where there is a shortage of healthcare providers, they have a more critical function in triaging individuals presenting with symptoms and referring them to care, if necessary, thereby, reducing the burden on traditional healthcare services. Additionally, such bots also play an important role in providing counselling and social support to individuals who might suffer from conditions that may be stigmatized or have a shortage of skilled healthcare providers. Many of the apps reviewed were focused on mental health, as was seen in other reviews of health chatbots9,27,30,33.

Several areas for further development of such chatbot apps have been identified through this review. First, data privacy and security continue to be a significant and prevalent concern—especially when sharing potentially sensitive health-related data. A small percentage of apps included in our review noted any form of data privacy or security, namely via identity verification and use of HIPAA or COPAA. With the sensitive nature of the data, it is important for these health-related chatbot apps to be transparent about how they are ensuring the confidentiality and safety of the data that is being shared with them49. A review of mHealth apps recommended nine items to support data privacy and security, which included ensuring the patient has control over the data provided, password authentication to access the app, and privacy policy disclosures50. Second, most healthbots have written input. Development and testing of such chatbot apps with larger input options such as spoken, visual, will facilitate improvements in access and utility. Third, several chatbot apps claim the use of artificial intelligence and machine learning but provide no further details. Our assessment of the NLP system design was limited given the scarce reporting on these aspects. As apps increasingly use ML, for utility in the healthcare context, they will need to be systematically assessed to ensure the safety of target users. This raises a need for clearer reporting on aspects of the ML techniques incorporated. As healthbots evolve to use newer methods to improve usability, satisfaction, and engagement, so do new risks associated with the automation of the chat interface. Healthbot app creators should report on what type of safety and bias protection mechanisms are employed to mitigate potential harm to their users, explain potential harms and risks of using the healthbot app to the users, and regularly monitor and track these mechanisms. These risks should be included in an industry-standard such as the ISO/TS 2523851. It is also important for the system to be transparent regarding the recommendations and informational responses, and how they are generated52. The databases and algorithms that are used to program healthbots are not removed from bias, which can cause further harm to users if not accounted for. The framework presented in this paper can guide systematic assessments and documents of features of healthbots.

To our knowledge, our study is the first comprehensive review of healthbots that are commercially available on the Apple iOS store and Google Play stores. Laranjo et al. conducted a systematic review of 17 peer-reviewed articles9. This review highlighted promising results regarding the acceptability of healthbots for health, as well as the preference for finite-state (wherein users are guided through a series of predetermined steps in the chatbot interaction) and frame-based (wherein the chatbot asks user questions to determine the direction of the interaction) dialogue management systems9. Another review conducted by Montenegro et al. developed a taxonomy of healthbots related to health32. Both of these reviews focused on healthbots that were available in scientific literature only and did not include commercially available apps. Our study leverages and further develops the evaluative criteria developed by Laranjo et al. and Montenegro et al. to assess commercially available health apps9,32. Similar to our findings, existing reviews of healthbots reported the paucity of standardization metrics to evaluate such chatbots, which limits the ability to rigorously understand the effectiveness, user satisfaction and engagement, risks and harm caused by the chatbot, and potential for use30.

The findings of this review should be seen in the light of some limitations. First, we used IAB categories, classification parameters utilized by 42Matters; this relied on the correct classification of apps by 42Matters and might have resulted in the potential exclusion of relevant apps. Additionally, the use of healthbots in healthcare is a nascent field, and there is a limited amount of literature to compare our results. Furthermore, we were unable to extract data regarding the number of app downloads for the Apple iOS store, only the number of ratings. This resulted in the drawback of not being able to fully understand the geographic distribution of healthbots across both stores. These data are not intended to quantify the penetration of healthbots globally, but are presented to highlight the broad global reach of such interventions. Only 10% of the apps were screened by two reviewers. Another limitation stems from the fact that in-app purchases were not assessed; therefore, this review highlights features and functionality only of apps that are free to use. Lastly, our review is limited by the limitations in reporting on aspects of security, privacy and exact utilization of ML. While our research team assessed the NLP system design for each app by downloading and engaging with the bots, it is possible that certain aspects of the NLP system design were misclassified.

Our review suggests that healthbots, while potentially transformative in centering care around the user, are in a nascent state of development and require further research on development, automation, and adoption for a population-level health impact.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.