Abstract
This perspective article explores the challenges and potential of using speech as a biomarker in clinical settings, particularly when constrained by the small clinical datasets typically available in such contexts. We contend that by integrating insights from speech science and clinical research, we can reduce sample complexity in clinical speech AI models with the potential to decrease timelines to translation. Most existing models are based on high-dimensional feature representations trained with limited sample sizes and often do not leverage insights from speech science and clinical research. This approach can lead to overfitting, where the models perform exceptionally well on training data but fail to generalize to new, unseen data. Additionally, without incorporating theoretical knowledge, these models may lack interpretability and robustness, making them challenging to troubleshoot or improve post-deployment. We propose a framework for organizing health conditions based on their impact on speech and promote the use of speech analytics in diverse clinical contexts beyond cross-sectional classification. For high-stakes clinical use cases, we advocate for a focus on explainable and individually-validated measures and stress the importance of rigorous validation frameworks and ethical considerations for responsible deployment. Bridging the gap between AI research and clinical speech research presents new opportunities for more efficient translation of speech-based AI tools and advancement of scientific discoveries in this interdisciplinary space, particularly if limited to small or retrospective datasets.
Similar content being viewed by others
Introduction
Recently, there has been a surge in interest in leveraging the acoustic properties (how it sounds) and linguistic content (what is said) of human speech as biomarkers for various health conditions. The underlying premise is that disturbances in neurological, mental, or physical health, which affect the speech production mechanism, can be discerned through alterations in speech patterns. As a result, there is a growing emphasis on developing AI models that use speech for the diagnosis, prognosis, and monitoring of conditions such as mental health1,2,3,4,5, cognitive disorders6,7,8,9,10, and motor diseases11,12,13,14,15, among others.
The development of clinical speech AI has predominantly followed a supervised learning paradigm, building on the success of data-driven approaches for consumer speech applications16,17. For instance, analysis of published speech-based models for dementia reveals that most models rely on high-dimensional speech and language representations18, either explicitly extracted or obtained from acoustic foundation models19,20 and language foundation models21,22, to predict diagnostic labels9,23,24,25; a similar trend is observed for depression5,26. The foundational models, initially pre-trained on data from general populations, are subsequently fine-tuned using clinical data to improve predictive accuracy for specific conditions. While data-driven classification models based on deep learning have worked well for data-rich applications like automatic speech recognition (ASR), the challenges in high-stakes clinical speech technology are distinctly different due to a lack of data availability at scale. For example, in the ASR literature, speech corpora can amount to hundreds of thousands of hours of speech samples and corresponding transcripts upon which models can be robustly trained in supervised fashion16,17. In contrast, currently available clinical datasets are much smaller, with the largest samples in the meta-analysis9,24,25 consisting of only tens to hundreds of minutes of speech or a few thousand words. This is because clinical data collection is inherently more challenging than in other speech-based applications. Clinical populations are more diverse and present with variable symptoms that must be simultaneously collected with the speech samples, ensuring proper sampling from relevant strata.
Compounding the data problem is the fact that the ground truth accuracy of diagnostic labels for different conditions where speech is impacted varies from 100% certainty to less than 50% certainty, particularly in the early stages of disease when mild symptoms are nonspecific and present similarly across many different diseases27,28,29,30,31,32,33,34. Retrospective data often used to train published models does not always report diagnostic label accuracy or the criteria used to arrive at a diagnosis. Collecting representative, longitudinal speech corpora with paired consensus diagnoses is time-intensive and further impedes the development of large-scale corpora, which are required for developing diagnostic models based on supervised learning. Unfortunately, supervised models built on smaller-scale corpora often exhibit over-optimistic performance in controlled environments35 and fail to generalize in out-of-sample deployments36,37. This begs the question of how we can successfully harness the power of AI to advance clinical practice and population health in the context of data availability constraints.
Here we propose that the clinical data constraints provide an opportunity for co-design of new analytics pipelines with lower sample complexity in collaboration with the clinical speech science community. The clinical speech science community has long studied the correlational and causal links between various health conditions and speech characteristics38,39,40,41,42. This research has focused on the physiological, neurological, and psychological aspects of speech production and perception, primarily through acoustic analysis of the speech signal, and linguistic analysis of spoken language. They involve interpretable and conceptually meaningful attributes of speech, often measured perceptually43, via functional rating scales15, or self-reported questionnaires44. Contributions from speech scientists, neuroscientists, and clinical researchers have deepened our understanding of human speech production mechanisms and their neural underpinnings, and particularly how neurodegeneration manifests as characteristic patterns of speech decline across clinical conditions43,45.
A co-design of a new explainable analytics pipeline can intentionally integrate scientific insights from speech science and clinical research into existing supervised models. We hypothesize that this will reduce timelines to translation, therefore providing an opportunity to grow clinical data scale through in-clinic use. As data size grows, data-driven methods with greater analytic flexibility can be used to discover new relations between speech and different clinical conditions and to develop more nuanced analytical models that can be confidently deployed for high-stakes clinical applications.
Bridging the gap between speech AI and clinical speech research leads to new opportunities in both fields. There is a clear benefit to the development of more sensitive tools for the assessment of speech for the clinical speech community. Existing instruments for assessment of speech exhibit variable within-rater and between-rater variability46. Developing objective proxies for these clinically-relevant constructs has the potential for increased sensitivity and reduced variability. More sensitive objective measures can also catalyze scientific discovery, enabling the identification of yet-to-be-discovered speech patterns across different clinical conditions. Conversely, effectively connecting speech AI research with clinical research enables AI developers to prioritize challenges directly aligned with clinical needs and streamline model building by leveraging domain-specific knowledge to mitigate the need for large datasets. To date, model developers have often overlooked feasibility constraints imposed by the inherent complexity of the relationship between speech production and the condition of interest. For example, recent efforts in clinical speech AI have focused on the cross-sectional classification of depression from short speech samples5,26. Given the well-documented variability in speech production47, the limitations of existing instruments for detecting depression40, and the heterogeneity in the manifestation of depression symptoms48, it is unlikely that stand-alone speech-based models will yield high-accuracy diagnostic models. Other studies have proposed using speech to predict conditions like coronary artery disease49 or diabetes50. However, to the best of our knowledge, there is no substantial literature supporting the hypothesis that speech changes are specific enough to these conditions to serve as stand-alone indicators. In working with small data sets, understanding the approximate limits of prediction is critical for resource allocation and avoiding unwarranted conclusions that could lead to premature model deployment.
This perspective article advocates for a stronger link between the speech AI community and clinical speech community for the development of scientifically-grounded explainable models in clinical speech analytics. We begin by presenting a new framework for organizing clinical conditions based on their impact on the speech production mechanism (see Fig. 1). We believe such a framework is important to facilitate a shared understanding of the impact of clinical conditions on speech and stimulate interdisciplinary thought and discussion. It is useful in categorizing health conditions by the complexity and uncertainty they present for speech-based clinical AI models and provides a mental model for considering the inherent limitations of speech-based classification across different conditions. It orients researchers to consider the challenges posed by limited clinical datasets during model development, and helps prevent frequent methodological errors. This has the potential to expedite progress and further foster collaboration between the speech AI community and the clinical speech community. We then explore various contexts of use for speech analytics beyond cross-sectional classification, highlighting their clinical value and the value they provide to the clinical speech research community (see Fig. 2). The discussion further examines how the selected context of use influences model development and validation, advocating for the use of lower-dimensional, individually-validated and explainable measures with potential to reduce sample size requirements (see Fig. 3). The paper concludes with a discussion on ethical, privacy, and security considerations, emphasizing the importance of rigorous validation frameworks and responsible deployment (see Fig. 4).
The clinically-relevant information in speech
The production of spoken language is a complex, multi-stage process that involves precise integration of language, memory, cognition, and sensorimotor functions. Here we use the term ‘speech production’ to refer broadly to the culmination of these spoken language processes. There are several extant speech production models, each developed to accomplish different goals (see, for example51,52,53,54,55). Common to these models is that speech begins with a person conceptualizing an idea to be communicated, formulating the language that will convey that idea, specifying the sensorimotor patterns that will actualize the language, and then speaking56:
-
Conceptualization: the speaker forms an abstract idea that they want to verbalize (Abstract idea formulation) and the intention to share through speech (Intent to speak).
-
Formulation: the speaker selects the words that best convey their idea and sequences them in an order allowed by the language (Linguistic formulation). Then they plan the sequence of phonemes and the prosodic pattern of the speech to be produced (Morphological encoding). Next, they program a sequence of neuromuscular commands to move speech structures (Phonetic encoding).
-
Articulation: the speaker produces words via synergistic movement of the speech production system. Respiratory muscles produce a column of air that drives the vocal folds (Phonation) to produce sound. This sound is shaped by the Articulator movements to produce speech. Two feedback loops (Acoustic feedback and Proprioceptive feedback) refine the neuromuscular commands produced during the Phonetic encoding stage over time.
Figure 1 introduces a hierarchy, or ordering, of health conditions based on how direct their impact is on the speech production mechanism. This hierarchy, motivated by initial work on speech and stress57, roughly aligns with the three stages of speech production and has direct consequences for building robust clinical speech models based on supervised learning.
This hierarchy compels researchers to ask and answer three critical questions prior to engaging in AI model development for a particular health condition. First, how directly and specifically does the health condition impact speech and/or language? In general, the further upstream the impact of a health condition on speech, the more indeterminate and nuanced the manifestations become, making it challenging to build supervised classification models on diagnostic labels. As we move from lower to higher-order health conditions, there are more mediating variables between the health condition and the observed speech changes, making the relationship between the two more variable and complex.
The second question the model compels researchers to ask and answer is what are the sensitivity and specificity of ground truth labels for the health condition? In general (but with notable exceptions), the objective accuracy of ground truth labels for the presence or absence of a health condition generally becomes less certain from lower to higher-order conditions, adding noise and uncertainty to any supervised classification models built upon the labels. High specificity of ground truth labels is critical for the development of models that distinguish between health conditions with overlapping speech and language symptoms. The answers to these two questions provide a critical context for predicting the utility of an eventual model prior to model building.
Finally, the hierarchy asks model developers to consider the relevant clinical speech symptoms to be considered in the model. In Table 1, we provide a more complete definition of each level in the hierarchy, a list of example conditions associated with the hierarchy, and primary speech symptoms associated with the condition. The list is not exhaustive and does not consider second and third-order impacts on speech. For example, Huntington’s disease (HD) has a first-order impact on speech causing hyperkinetic dysarthria (e.g. see Table 1). But it also has a second- and third-order impact to the extent one experiences cognitive issues and personality changes with the disease. Nevertheless, the table serves as a starting point for developing theoretically-grounded models. Directly modeling the subset of primary speech symptoms known to be impacted by the condition of interest may help reduce sample size requirements and result in smaller models that are more likely to generalize.
Ordering of health conditions based on speech impact
Zeroth-order conditions have direct, tangible effects on the speech production mechanism (including the structures of respiration, phonation, articulation, and resonance) that manifest in the acoustic signal, impacting the Articulation stage in our model in Fig. 1. This impact of the physical condition on the acoustic signal can be understood using physical models of the vocal tract and vocal folds58 that allow for precise characterization of the relationship between the health condition and the acoustics. As an example, benign vocal fold masses increase the mass of the epithelial cover of the vocal folds, thereby altering the stiffness ratio between the epithelial cover and the muscular body. The impact on vocal fold vibration and the resulting acoustic signal are amenable to modeling. These types of conditions are physically verifiable upon laryngoscopy, providing consistent ground truth labeling of the condition; and the direct relationship between the condition, its impact on the physical apparatus, and the voice acoustics is direct and quantifiable (although, note that differential diagnosis of vocal fold mass subtype is more difficult, see refs. 59,60). Thus, zeroth-order health conditions directly impact the speech apparatus anatomy and often have verifiable ground-truth labels.
First-order conditions interfere with the transduction of neuromuscular commands into movement of the articulators (e.g. dysarthria secondary to motor disorder). As with zeroth-order conditions, first-order conditions also disturb the physical speech apparatus and the Articulation stage in our model, however the cause is indirect. Injury or damage to the cortical and subcortical neural circuits and nerves impacts sensorimotor control of the speech structures by causing weakness, improper muscle tone and/or mis-scaling and incoordination of speech movements61. The sensorimotor control of speech movements is mediated through five neural pathways and circuits, each associated with a set of cardinal and overlapping speech symptoms: Upper and lower motor neuron pathways; the direct and indirect basal ganglia circuits; and the cerebellar circuit. Damage to these areas causes distinct changes in speech:
-
The lower motor neurons (cranial and spinal nerves, originating in brainstem and spinal cord, respectively) directly innervate speech musculature. Damage to lower motor neurons results in flaccid paralysis and reduced or absent reflexes in the muscles innervated by the damaged nerves, and a flaccid dysarthria when cranial nerves are involved.
-
The upper motor neurons originate in the motor cortex and are responsible for initiating and inhibiting activation of the lower motor neurons. Damage to upper motor neurons supplying speech musculature results in spastic paralysis and hyperreflexia, and a spastic dysarthria.
-
The basal ganglia circuit is responsible for facilitating and scaling motor programs and for inhibiting involuntary movements. Damage to the direct basal ganglia circuit causes too little movement (hypokinesia, as in Parkinson’s disease), resulting in a hypokinetic dysarthria; while damage to the indirect basal ganglia circuit causes too much movement (hyperkinesia, as in Huntington’s disease), resulting in a hyperkinetic dysarthria.
-
The cerebellar circuit is responsible for fine-tuning movements during execution. Damage to the cerebellar circuits result in incoordination, resulting in an ataxic dysarthria.
Speech symptoms are characteristic when damage occurs to any of these (or multiple) neural pathways, although there is symptom overlap and symptoms evolve in presence and severity as the disease progresses61. The diagnostic accuracy and test-retest reliability (within and between raters) of dysarthria speech labels from the speech signal alone (i.e., without knowledge of the underlying health condition) is known to be modest, except for expert speech-language pathologists with large and varied neurology caseloads62. Diagnosis of the corresponding health conditions relies on a physician’s clinical assessment and consideration of other confirmatory information beyond speech. Diagnostic accuracy is impacted by the physician’s experience and expertise, whether the symptoms presenting in the condition are textbook or unusual, and whether genetic, imaging, or other laboratory tests provide supporting or confirmatory evidence is available. For example, unilateral vocal fold paralysis is a first-order health condition with direct impact on the speech apparatus (impaired vocal fold vibration) and high-ground truth accuracy and specificity (can be visualized by laryngoscopy). In contrast, Parkinson’s disease (PD) has a diffuse impact on the speech apparatus (affecting phonation, articulation, and prosody) which is hard to distinguish from healthy speech or other similar health conditions (e.g., progressive supranuclear palsy) in early disease. The reported ground-truth accuracy of the initial clinical diagnosis ranges from 58% to 80%, calling into question clinical labels in early stage PD28.
Second-order conditions move away from the speech production mechanism’s structure and function and into the cognitive (i.e., memory and language) and perceptual processing domains. These conditions impact the Formulation stage of speaking and manifest as problems finding and sequencing the words to convey one’s intended message and may include deficits in speech comprehension. Alzheimer’s disease (AD) is a second-order condition that deserves particular attention because of the burgeoning efforts in the literature to develop robust supervised classification models63. AD disrupts the Formulation stage of speaking with word-finding problems, and the tendency to use simpler and more general semantic and syntactic structures. Natural language processing (NLP) techniques have been used to characterize these patterns and acoustic analysis has identified speech slowing with greater pausing while speaking, presumably because of decreased efficiency of cognitive processing and early sensorimotor changes9,24,25.
While the clinical study of speech and language in AD has consistently found evidence of such pattern changes in individuals diagnosed with probable AD, progress toward developing generalizable speech-based supervised learning clinical models for mild cognitive impairment (MCI) and AD has been relatively slow despite optimistic performance results reported in the literature35,63. We posit that this can be explained by answers to the first two questions that model in Fig. 1 compels researchers to consider. First, there is a lack of specificity of early speech and language symptoms to MCI and AD, given that the output is mediated by several intermediate stages and the variability associated with speech production. Mild and nonspecific speech and language symptoms will always pose a challenge for the development of clinical early detection/diagnostic speech tools until sufficient training data can result in the identification of distinct signatures (if they exist). Furthermore, given the current difficulty in accurately diagnosing MCI and AD, models based on supervised learning may be unwittingly using mislabeled training data and testing samples in their models. At present, AD is a clinical diagnosis, often preceded by a period of another clinical diagnosis of MCI. MCI is extremely difficult to diagnose with certainty, owing to variability in symptoms and their presentation over time, the overlap of speech and language symptoms with other etiologies, and the diagnostic reliance on self-report33. With the current absence of a definitive ground truth label for MCI or early Alzheimer’s disease, and the lack of specificity in speech changes, supervised learning models trained on small, questionably labeled data likely will continue to struggle to generalize to new data.
Third-order conditions impact the Conceptualization stage of speech production and include mental health conditions affecting mood and thought. These conditions can manifest in significant deficits and differences in speech and language, and this has been well-characterized in the literature4. For example, acoustic analysis can reveal rapid, pressed speech associated with mania, as well as slowed speech without prosodic variation that might accompany depression. Natural language processing can reveal and quantify disjointed and incoherent thought in the context of psychiatric disorders64. Despite this, the impact of these mood and thought conditions on the speech apparatus and language centers in the brain may be indirect and nonspecific relative to low-order conditions. Mental health conditions frequently cause a mixture or fluctuation of positive symptoms (e.g., hallucinations, mania) and negative symptoms (e.g., despondence, depression), which can present chronically, acutely, or intermittently. The associated speech and language patterns can be attributed to any number of other reasons (fatigue, anxiety, etc.) With regard to ground-truth accuracy and specificity, studies have shown that around half of schizophrenia diagnoses are inaccurate65. This problem has resulted in a push to identify objective biomarkers to distinguish schizophrenia from anxiety and other mood disorders66,67. This complicates the development of models for health condition detection and diagnosis; however, machine-learning models may be developed to objectively measure speech and language symptoms associated with specific symptomatology. For example, distinguishing between negative versus positive disease symptoms may be achievable with careful construction of speech elicitation tasks and normative reference data, given the central role that language plays in the definition of these symptoms68,69.
Across all health conditions, extraneous and comorbid factors can exert meaningful influence on speech production. For example, anxiety, depression, and fatigue, perhaps even as a consequence of an underlying illness, are known to impact the speech signal. It would not be straightforward to distinguish their influence from those of primary interest, adding complexity and uncertainty for models based on supervised learning, regardless of the health condition’s order. However, the increased variability in both data and diagnostic accuracy for many higher-order conditions makes speech-based models trained using supervised learning on small datasets vulnerable to reduced sensitivity and specificity. This is not merely a matter of augmenting the dimensionality of speech features or enlarging the dataset; it reflects the intrinsic variability in how humans generate speech. Finally, the accuracy and specificity of ground truth labels for health conditions are critical to consider in assessing the feasibility of interpretable model development. Unlike the static link between speech and the health condition, as diagnostic technologies advance and criteria evolve, the accuracy of these labels is expected to improve over time, thereby potentially enabling more robust model development.
Defining an appropriate context of use
As mentioned before, most published clinical speech AI development studies are based on supervised learning where developers build AI models to distinguish between two classes or to predict disease severity. This approach generally presumes the same context of use for clinical speech analytics across different applications: namely, the cross-sectional detection of a specific condition or a prediction of clinical severity based on a speech sample. As we established in the foregoing discussion, this approach, when combined with limited training data, is less likely to generalize.
Nevertheless, there are a number of use cases, in which speech analytics and AI can provide more immediate value and expedite model translation. These are outlined in Fig. 2, where we explore these applications in greater depth. Focusing on these use cases will reduce timelines to translation, providing an opportunity to grow clinical data scale through in-clinic collection. With increased data size and diversity, researchers will better characterize currently-unknown fundamental limits of prediction for speech-based classification models for higher-order conditions (e.g. how well can we classify between depressed and non-depressed speech); and can bring to bear more advanced data-driven methods to problems that provide clinical value.
Diagnostic assistance
Despite rapid advancements in biomedical diagnostics, the majority of neurodegenerative diseases are diagnosed by the presence of cardinal symptoms on clinical exams. As discussed previously and as shown in Table 1, many health conditions include changes in speech as a core symptom. For example, diagnosis of psychiatric conditions involves analysis of speech and language attributes, such as coherence, fluency, and tangentiality70. Likewise, many neurodegenerative diseases lead to dysarthria, and a confirmatory speech deficit pattern can be used to support their diagnoses61. Tools for the assessment of these speech deficit patterns in the clinical setting typically depend on the clinical judgment or on scales reported by patients themselves. There is a large body of evidence indicating that these methods exhibit variable reliability, both between different raters and within the same rater over time46,62. Clinical speech analytics has the potential to enhance diagnostic accuracy by providing objective measures of clinical speech characteristics that contribute to diagnosis, such as hypernasality, impaired vocal quality, and articulation issues in dysarthria; or measures of coherence and tangentiality in psychosis. These objective measures can provide utility for manual diagnosis in clinic or can be used as input into multi-modal diagnostic systems based on machine learning.
Non-specific risk assessment tools
While differential diagnosis based on speech alone is likely not possible for many conditions, progressive and unremitting changes in certain aspects of speech within an individual can be a sign of an underlying illness or disorder61. Clinical speech analytics can be used to develop tools that track changes in speech along specific dimensions known to be vulnerable to degradation in different conditions. This could provide value as an early-warning indicator, particularly as the US health system moves toward home-based care and remote patient monitoring. Such a tool could be used as a non-specific risk assessment tool triggering additional tests when key speech changes reach some threshold or is supported by changes in other monitored modalities.
Longitudinal tracking post-diagnosis
In many conditions, important symptoms can be tracked via speech post-diagnosis. For example, tracking bulbar symptom severity in ALS, as a proxy for general disease progression, can provide insights on when AAC devices should be considered or to inform end-of-life planning71. In Parkinson’s disease, longitudinal tracking of speech symptoms would be beneficial for drug titration72,73. In dementia, longitudinal tracking of symptoms measurable via speech (e.g. memory, cognitive-linguistic function) can provide valuable information regarding appropriate care and when changes need to be made.
Speech as a clinically meaningful endpoint
Speech is our principal means of communication and social interaction. Conditions that impair speech can severely hinder a patient’s communicative abilities, thereby diminishing their overall quality of life. Current methods for assessing communication outcomes include perceptual evaluations, such as listening and rating, or self-reported questionnaires61,69. In contrast to the use case as a solitary diagnostic tool, employing clinical speech analytics to objectively assess communicative abilities is inherently viable across many conditions. This is due to the direct correlation between the construct (communicative ability) and the input (speech). For instance, in dysarthria, clinical speech analytics may be utilized to estimate intelligibility, the percentage of words understood by listeners, which significantly affects communicative participation74. In psychosis, speech analytics can facilitate the creation of objective tools for assessing social competencies; these competencies are closely tied to quality of life indicators69. Similarly, in dementia, a decline in social interaction can lead to isolation and depression, perhaps hastening cognitive decline75. A related emerging use case in Alzheimer’s disease is providing context for blood-based diagnostics. As new biomarkers with confirmatory evidence of pathophysiology emerge, there will likely be an increase in Alzheimer’s diagnoses without co-occurring clinical-behavioral features. The group of patients with AD diagnoses, but without symptoms, will require context around this diagnosis. Speech analytics will be important as measures of behavioral change that are related to quality of life.
Improving clinical trial design
The Food and Drug Administration (FDA) prioritizes patient-relevant measures as endpoints in clinical trials. They have also identified speech and communication metrics as particularly underdeveloped for orphan diseases76. Objective and clinically-meaningful measures based on speech analytics that are collected more frequently can result in an improved sensitivity for detecting intervention effects. Such measures have the potential to decrease the required sample sizes for drug trials, enable more efficient enrollment, or to ascertain efficacy with greater efficiency77.
Facilitating development of digital therapeutics
There has been significant recent interest in development of digital therapeutics for various neurological and mental health conditions. Several of these devices target improving the patients’ social skills or communication abilities78. In this evolving space, introducing concrete digital markers of social competence allows for more efficient evaluation of efficacy and precision approaches for customizing therapeutics for the patient.
Development and validation of robust models
The context of use profoundly influences the development of clinical speech AI models, shaping their design, validation, and implementation strategies. For example, for contexts of use involving home monitoring, robustness to background noise, variability in recording conditions and usability are essential. For longitudinal monitoring, developed tools must be sensitive to subtle changes in speech characteristics relevant to the progression of the condition being monitored. This necessitates longitudinal data collection for development and validation to ensure stability and sensitivity over time. Screening tools in diverse populations require a training dataset that captures demographic variability to avoid bias. Solutions based on noisy diagnostic labels may require uncertainty modeling through Bayesian machine learning or ensemble methods that quantify prediction confidence79. Concurrently, techniques like label smoothing80 and robust loss functions81 can enhance model resilience under label noise.
Each context of use presents a custom development path to address the unique challenges and a parallel validation strategy that spans hardware, analytical validation, and clinical validation - see Fig. 3. The current approach focused on data-driven supervised learning on diagnostic labels limits the development and understanding of new models and makes model validation challenging. While there are many validation metrics for evaluating AI model performance, the prevalent metrics in published speech-based models primarily focus on estimating “model accuracy” (e.g. what percent of the time does the model correctly classify between Healthy and Dementia labels based on speech) using a number of methods (e.g. cross-validation, held-out test accuracy). However, accurately estimating the model accuracy of high-dimensional supervised learning models is challenging, and current methods are prone to overoptimism35. In addition, many supervised machine learning models are sensitive to input perturbations, which is a significant concern for speech features known for their day-to-day variability82. Consequently, model performance diminishes with any temporal variation in the data.
A starting point for clinical model validation is the Verification/Analytical Validation/Clinical Validation (V3) framework, a framework for validating digital biometric monitoring technologies. The original version of the framework proposes a structured approach with three evaluation levels: Verification of hardware, Analytical Validation, and Clinical Validation83. This framework has roots in principles of Verification and Validation for software quality product management and deployment84. While these existing validation systems are designed to confirm that the end system accurately measures what it purports to measure, the V3 framework adds the additional step of confirming that the clinical tools are meaningful to a defined clinical population. To that end, Verification ascertains the sensor data’s fidelity within its intended environment. Analytical validation examines the accuracy of algorithms processing sensor data to yield behavioral or physiological metrics, and clinical validation evaluates clinical model outputs with clinic ground truths or established measures known to be meaningful to patients. This includes existing clinical scales like the PHQ-9 (depression) or the UPDRS (Parkinson’s disease). In Fig. 3 we provide a high-level overview of the end-to-end development and validation process for clinical speech AI. It is important to note that the V3 is a conceptual framework that must be specifically instantiated for the validation of different clinical speech applications. While it can help guide the development of a validation plan, it does not provide one out of the box. Furthermore, this level of validation is only a starting point as the FDA suggests constant model monitoring post-deployment to ensure continued generalization85.
Supervised learning approaches based on uninterpretable input features and clinical diagnostic labels make adoption of the complete V3 framework challenging. Analytical validation is especially challenging as it’s difficult to ensure that learned speech representations are measuring or detecting physiological behaviors of interest. For example, in Parkinson’s disease, both the speaking rate and the rate of opening and closing of vocal folds is impacted. Uninterpretable features have unknown relationships with these behavioral and physiological parameters. As an alternative, model developers can use representations that are analytically validated relative to these constructs. This would lead to more interpretable clinical models. Validation should be approached end-to-end during the development process, with different stages (and purposes of analysis) employing different validation methods. Small-scale pilot tests may focus on parts of this framework. However, for work with deployment as a goal, ensuring generalizability and clinical utility requires validating the hardware on which the speech was collected, ensuring that intermediate representations are valid indicators of behavioral and physiological measures (e.g speaking rate, articulatory precision, language coherence), and clinical models developed using these speech measures are associated with existing clinical ground truths or scales that are meaningful to patients86.
Interpretable, clinically-important measures based on speech are currently missing from the literature. Clinically-relevant feature discovery and model performance evaluation in speech analytics are challenged by the high-dimensionality of speech, complex patterns, and limited datasets. Table 1 highlights several speech constructs that have been studied relative to various conditions; however, most of these constructs do not have standardized operational definitions in the clinical speech analytics literature. Instead, model developers rely on high-dimensional representations that have been developed for other purposes. For example, adopted from the ASR literature, many clinical models use representations based on mel-frequency cepstral coefficients or mel-spectra18; or representations learned by pre-trained foundation models19,20. However, these features are not interpretable, making analytical and clinical validation challenging.
Development of a clinically-tailored speech representation could significantly refine the development process, favoring smaller, individually validated, and clinically-grounded features that allow scientists to make contact with the existing literature and mitigate model overfitting and variability. This field would benefit from a concerted and synergistic effort in the speech AI community and the speech science community to operationalize and validate a measurement model for the intermediate constructs like those listed in Table 187. For example, in our previous work, we made progress in this direction by developing measurement models for the assessment of hypernasality and consontant-vowel transitions and used it to evaluate cleft lip and palate and dysarthria88,89; several measures of volition and coherence for schizophrenia69; and measures of semantic relevance for dementia10. Individually-validated interpretable measures allow for easier alignment to different contexts of use, integration within larger multi-modal systems, and establish a more direct link to the existing clinical literature. Furthermore, they can be used as a way of explaining the operation of larger, more complex models via bottleneck constraints90 or they can be combined with new methods in causal machine learning for development of explainable models91.
Finally, clinically-interpretable representations can also play a pivotal role in integrating the patient’s perspective into the design of algorithms. The idea is that by aligning closely with the lived experiences and symptoms important to patients, these representations ensure that algorithmic outcomes resonate with the quality of life impact of health conditions. The hypothesis is that this patient-centric approach could have the added benefit of reinforcing patient trust and engagement in digital health.
Ethical, privacy, and security considerations
The deployment and regulation of clinical speech models in healthcare present multiple challenges and risks. Prematurely launched models (without robust validation) risk delivering clinically inaccurate results and potentially causing patient harm, while biases in model training can lead to skewed performance across diverse populations. Moreover, the use of speech data for health analytics raises significant privacy and security concerns. We outline these considerations in Fig. 4 and expand on them below.
Premature deployment of inaccurate models
A primary risk of prematurely-deployed models is that they will provide clinically inaccurate output. As discussed in previous work35, current strategies to validate AI models are insufficient and produce overoptimistic estimates of accuracy. Several studies have highlighted this as a more general problem in AI-based science92,93. However, reported accuracy metrics carry much weight when presented to the public and can lead to premature deployment. There is considerable risk that these models will fail if deployed and potentially harm patients94. For example, consider the Cigna StressWaves Test model, deployed after only internal evaluation and no public efficacy data. This model analyzes a user’s voices to predict their stress level and is publicly available on the Cigna Website. Independent testing of the model reveals that it has poor test-retest reliability (measured via intraclass correlation) and poor agreement with existing instruments for measuring stress37.
Biased models
An additional risk of clinical speech-based models stems from the homogeneity of the data often used to train these models. Biological and socio-cultural differences contribute significantly to the variation in both the speech signal and the clinical conditions (impacting aspects from risk factors to treatment efficacy). Careful consideration of these differences in model building necessitates robust experiment design and representative stratification of data. However, a recent study demonstrates that published clinical AI models are heavily biased demographically, with 71% of the training data coming from only three states: California, Massachusetts, and New York, with 34 of the states not represented at all95. Similarly, analysis of clinical speech datasets indicates a significant skew towards the English language, overlooking the linguistic diversity of global populations. To accurately capture health-related speech variations, it’s essential to broaden data collection efforts to include a more representative range of the world’s native languages as health-related changes in speech can be native language-specific96. It becomes challenging to determine how models trained on unrepresentative data would perform when deployed for demographic groups for which they were not trained.
Privacy and security considerations
Speech and language data is widely available and, as we continue to interact with our mobile devices, we generate an ever-growing personal footprint of our health status. Previous studies have shown that this data (speeches, social media posts, interviews) can be analyzed for health analytics97,98,99. There is a risk that similar data on an even larger scale and over longer periods of time can be accessed by technology companies to make claims about the health or emotional state of their users without their permission or by national or international adversaries to advance a potentially false narrative on the health of key figures. The risks to the privacy of this type of analysis, if used outside of academic research, is considerable, with national and international political ramifications. Internally, political adversaries can advance a potentially false narrative on the health of candidates. Internationally, geopolitical adversaries could explore this as an additional dimension of influence in elections.
There is no silver bullet to reduce these risks, however, there are several steps that can be taken as mitigation strategies. With the public availability of speech technology, building AI models has become commoditized; however, the bottleneck remains prospective validation. Thorough validation of the model based on well-accepted frames such as the V3 framework is crucial prior to deployment83. This validation must extend beyond initial data sets and include diverse demographic groups to mitigate biases. Moreover, developers should engage in continuous post-deployment monitoring to identify and rectify any deviations in model performance or emergent biases. Transparency in methodology and results, coupled with responsible communication to the public, can reduce the risks of misperceived model accuracy.
On the privacy front, there are emerging technical solutions to parts of this problem based on differential privacy and federated learning100,101,102; however, a complete socio-technical solution will require stringent data protection regulations and ethical guidelines to safeguard personal health information. First, it is wise to reconsider IRB review protocols in light of new technologies and publicly available data; in industry, proactive collaboration with regulatory bodies (e.g. FDA) can help establish clear guidelines. This is clear for companies focused on clinical solutions, however, the regulation of AI-based devices for technology companies, particularly those focused on wellness, is less well-defined. Recent guidance from the Federal Trade Commission (FTC) advising companies to only make evidence-backed claims about AI-driven products is a step in the right direction103.
Data availability
There is no data associated with this manuscript as it is a perspectives article centered around a theoretical framework.
References
Niu, M., Romana, A., Jaiswal, M., McInnis, M. & Mower Provost, E. Capturing mismatch between textual and acoustic emotion expressions for mood identification in bipolar disorder. In Proc. INTERSPEECH 2023, 1718–1722 (2023).
Koops, S. et al. Speech as a biomarker for depression. CNS Neurol. Disord. Drug Targets 22, 152–160 (2023).
Wu, P. et al. Automatic depression recognition by intelligent speech signal processing: A systematic survey. CAAI Trans. Intell. Technol. 8, 701–711 (2023).
Low, D. M., Bentley, K. H. & Ghosh, S. S. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investig. Otolaryngol. 5, 96–116 (2020).
Cummins, N. et al. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 71, 10–49 (2015).
Braun, F. et al. Classifying dementia in the presence of depression: A cross-corpus study. In INTERSPEECH 2023 (ISCA, ISCA, 2023).
Zolnoori, M., Zolnour, A. & Topaz, M. Adscreen: A speech processing-based screening system for automatic identification of patients with alzheimer’s disease and related dementia. Artif. Intell. Med. 143, 102624 (2023).
Agbavor, F. & Liang, H. Predicting dementia from spontaneous speech using large language models. PLOS Digit. Health 1, e0000168 (2022).
Martínez-Nicolás, I., Llorente, T. E., Martínez-Sánchez, F. & Meilán, J. J. G. Ten years of research on automatic voice and speech analysis of people with Alzheimer's disease and mild cognitive impairment: a systematic review article. Front. Psychol. 12, 620251 (2021).
Stegmann, G. et al. Automated semantic relevance as an indicator of cognitive decline: Out-of-sample validation on a large-scale longitudinal dataset. Alzheimer’s. Dement.: Diagn. Assess. Dis. Monit. 14, e12294 (2022).
Ríos-Urrego, C. D., Rusz, J., Nöth, E. & Orozco-Arroyave, J. R. Automatic classification of hypokinetic and hyperkinetic dysarthria based on GMM-Supervectors. In INTERSPEECH 2023 (ISCA, ISCA, 2023).
Reddy, M. K. & Alku, P. Exemplar-based sparse representations for detection of Parkinson’s disease from speech. IEEE/ACM Trans. Audio, Speech Lang. Process. 31, 1386–1396 (2023).
Khaskhoussy, R. & Ayed, Y. B. Improving Parkinson’s disease recognition through voice analysis using deep learning. Pattern Recognit. Lett. 168, 64–70 (2023).
Moro-Velazquez, L., Gomez-Garcia, J. A., Arias-Londoño, J. D., Dehak, N. & Godino-Llorente, J. I. Advances in Parkinson’s disease detection and assessment using voice and speech: A review of the articulatory and phonatory aspects. Biomed. Signal Process. Control 66, 102418 (2021).
Stegmann, G. M. et al. Early detection and tracking of bulbar changes in ALS via frequent and remote speech analysis. NPJ Digital Med. 3, 132 (2020).
Radford, A. et al. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, 28492–28518 (PMLR, 2023).
Rao, K., Sak, H. & Prabhavalkar, R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 193–199 (IEEE, 2017).
Stegmann, G. M. et al. Repeatability of commonly used speech and language features for clinical applications. Digit. Biomark. 4, 109–122 (2020).
Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020).
Babu, A. et al. XLS-R Self-supervised cross-lingual speech representation learning at scale. In Proc. Interspeech 2278–2282 (2022).
Achiam, J. et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Kenton, J. D. M.-W. C. & Toutanova, L. K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, vol. 1, 2 (2019).
Wang, C., Liu, S., Li, A. & Liu, J. Text dialogue analysis for primary screening of mild cognitive impairment: Development and validation study. J. Med. Internet Res. 25, e51501 (2023).
de la Fuente Garcia, S., Ritchie, C. W. & Luz, S. Artificial intelligence, speech, and language processing approaches to monitoring Alzheimer’s disease: a systematic review. J. Alzheimer’s. Dis. 78, 1547–1574 (2020).
Petti, U., Baker, S. & Korhonen, A. A systematic literature review of automatic Alzheimer’s disease detection from speech and language. J. Am. Med. Inform. Assoc. 27, 1784–1797 (2020).
Flanagan, O., Chan, A., Roop, P. & Sundram, F. Using acoustic speech patterns from smartphones to investigate mood disorders: scoping review. JMIR mHealth uHealth 9, e24352 (2021).
Boushra, M. & McDowell, C. Stroke-Like Conditions (StatPearls Publishing, Treasure Island (FL), 2023). http://europepmc.org/books/NBK541044.
Beach, T. G. & Adler, C. H. Importance of low diagnostic accuracy for early Parkinson’s disease. Mov. Disord. 33, 1551–1554 (2018).
Richards, D., Morren, J. A. & Pioro, E. P. Time to diagnosis and factors affecting diagnostic delay in amyotrophic lateral sclerosis. J. Neurol. Sci. 417, 117054 (2020).
Schulz, J. B. et al. Diagnosis and treatment of Friedreich Ataxia: a european perspective. Nat. Rev. Neurol. 5, 222–234 (2009).
Dang, J. et al. Progressive apraxia of speech: Delays to diagnosis and rates of alternative diagnoses. J. Neurol. 268, 4752–4758 (2021).
Escott-Price, V. et al. Genetic analysis suggests high misassignment rates in clinical Alzheimer's cases and controls. Neurobiol. aging 77, 178–182 (2019).
Edmonds, E. C., Delano-Wood, L., Galasko, D. R., Salmon, D. P. & Bondi, M. W. Subjective cognitive complaints contribute to misdiagnosis of mild cognitive impairment. J. Int. Neuropsychol. Soc. 20, 836–847 (2014).
Rokham, H., Falakshahi, H., Fu, Z., Pearlson, G. & Calhoun, V. D. Evaluation of boundaries between mood and psychosis disorder using dynamic functional network connectivity (dfnc) via deep learning classification. Hum. Brain Mapp. 44, 3180–3195 (2023).
Berisha, V., Krantsevich, C., Stegmann, G., Hahn, S. & Liss, J. Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2022, 2453–2457 (2022).
Chekroud, A. M. et al. Illusory generalizability of clinical prediction models. Science 383, 164–167 (2024).
Yawer, B., Liss, J. & Berisha, V. Reliability and validity of a widely-available ai tool for assessment of stress based on speech. In Nature Scientific Reports (2023).
Behrman, A. Speech and voice science (Plural publishing, 2021).
Hixon, T. J., Weismer, G. & Hoit, J. D. Preclinical speech science: Anatomy, physiology, acoustics, and perception (Plural Publishing, 2018).
LaPointe, L. L. Paul Broca and the origins of language in the brain (Plural Publishing, 2012).
Raphael, L. J., Borden, G. J. & Harris, K. S. Speech science primer: Physiology, acoustics, and perception of speech (Lippincott Williams & Wilkins, 2007).
Ferrand, C. T. Speech science: An integrated approach to theory and clinical practice. Ear Hearing 22, 549 (2001).
Duffy, J. R. In Motor Speech Disorders-E-Book: Substrates, differential diagnosis, and management (Elsevier Health Sciences, 2012).
Baylor, C. et al. The communicative participation item bank (CPIB): Item bank calibration and development of a disorder-generic short form. J. Speech Lang. Hear. Res. 56, 1190–1208 (2013).
Boschi, V. et al. Connected speech in neurodegenerative language disorders: a review. Front. Psychol. 8, 208495 (2017).
Bunton, K., Kent, R. D., Duffy, J. R., Rosenbek, J. C. & Kent, J. F. Listener agreement for auditory-perceptual ratings of dysarthria. J. Speech Lang. Hear. Res. 50, 1481–1495 (2007).
Perkell, J. S. & Klatt, D. H. Invariance and variability in speech processes (Psychology Press, 2014).
Fried, E. Moving forward: how depression heterogeneity hinders progress in treatment and research. Expert Rev. Neurother. 17, 423–425 (2017).
Sara, J. D. S. et al. Noninvasive voice biomarker is associated with incident coronary artery disease events at follow-up. In Mayo Clinic Proceedings, vol. 97, 835–846 (Elsevier, 2022).
Kaufman, J. M., Thommandram, A. & Fossat, Y. Acoustic analysis and prediction of type 2 diabetes mellitus using smartphone-recorded voice segments. Mayo Clin. Proc.: Digit. Health 1, 534–544 (2023).
Bohland, J. W., Bullock, D. & Guenther, F. H. Neural representations and mechanisms for the performance of simple speech sequences. J. Cogn. Neurosci. 22, 1504–1529 (2010).
Goldrick, M. & Cole, J. Advancement of phonetics in the 21st century: Exemplar models of speech production. J. Phon. 99, 101254 (2023).
Houde, J. F. & Nagarajan, S. S. Speech production as state feedback control. Front. Hum. Neurosci. 5, 82 (2011).
Story, B. H. & Bunton, K. A model of speech production based on the acoustic relativity of the vocal tract. J. Acoust. Soc. Am. 146, 2522–2528 (2019).
Walker, G. M. & Hickok, G. Evaluating quantitative and conceptual models of speech production: how does slam fare? Psychon. Bull. Rev. 23, 653–660 (2016).
Levelt, W. J. Models of word production. Trends Cogn. Sci. 3, 223–232 (1999).
Steeneken, H. J. & Hansen, J. H. Speech under stress conditions: overview of the effect on speech production and on system performance. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 4, 2079–2082 (IEEE, 1999).
Alipour, F., Berry, D. A. & Titze, I. R. A finite-element model of vocal-fold vibration. J. Acoust. Soc. Am. 108, 3003–3012 (2000).
Marshall, C., Lyons, T., Al Omari, A., Alnouri, G. & Sataloff, R. T. Misdiagnosis of vocal fold nodules. J. Voice (2023).
Compton, E. C. et al. Developing an artificial intelligence tool to predict vocal cord pathology in primary care settings. Laryngoscope 133, 1952–1960 (2023).
Duffy, J. R. Motor speech disorders: Clues to neurologic diagnosis. In Parkinson’s disease and movement disorders: Diagnosis and treatment guidelines for the practicing physician, 35–53 (Springer, 2000).
Pernon, M., Assal, F., Kodrasi, I. & Laganaro, M. Perceptual classification of motor speech disorders: the role of severity, speech task, and listener’s expertise. J. Speech, Lang., Hear. Res. 65, 2727–2747 (2022).
Parsapoor, M. AI-based assessments of speech and language impairments in dementia. Alzheimer’s. Dement. 19, 4675–4687 (2023).
Voleti, R., Liss, J. M. & Berisha, V. A review of automated speech and language features for assessment of cognitive and thought disorders. IEEE J. Sel. Top. signal Process. 14, 282–298 (2019).
Kvig, E. I. & Nilssen, S. Does method matter? assessing the validity and clinical utility of structured diagnostic interviews among a clinical sample of first-admitted patients with psychosis: A replication study. Front. Psychiatry 14, 1076299 (2023).
Cachán-Vega, C. SOD and CAT as potential preliminary biomarkers for the differential diagnosis of schizophrenia and bipolar disorder in the first episode of psychosis. Eur. Psychiatry 66, S449–S450 (2023).
Gao, Y. et al. Decreased resting-state neural signal in the left angular gyrus as a potential neuroimaging biomarker of schizophrenia: an amplitude of low-frequency fluctuation and support vector machine analysis. Front. Psychiatry 13, 949512 (2022).
Kuperberg, G. Language in schizophrenia part 1: An introduction. language and linguistics compass, 4, 576–589 (2010).
Voleti, R. et al. Language analytics for assessment of mental health status and functional competency. Schizophr. Bull. 49, S183–S195 (2023).
Cohen, A. S., McGovern, J. E., Dinzeo, T. J. & Covington, M. A. Speech deficits in serious mental illness: a cognitive resource issue? Schizophr. Res. 160, 173–179 (2014).
Stegmann, G. et al. A speech-based prognostic model for dysarthria progression in als. Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration 1–6 (2023).
Farzanehfar, P., Woodrow, H. & Horne, M. Sensor measurements can characterize fluctuations and wearing off in parkinson’s disease and guide therapy to improve motor, non-motor and quality of life scores. Front. Aging Neurosci. 14, 852992 (2022).
Schulz, G. M. The effects of speech therapy and pharmacological treatments on voice and speech in Parkinson’s disease: A review of the literature. Curr. Med. Chem. 9, 1359–1366 (2002).
Borrie, S. A., Wynn, C. J., Berisha, V. & Barrett, T. S. From speech acoustics to communicative participation in dysarthria: Toward a causal framework. J. Speech, Lang., Hear. Res. 65, 405–418 (2022).
Shen, L.-X. et al. Social isolation, social interaction, and Alzheimer’s disease: a mendelian randomization study. J. Alzheimer’s. Dis 80, 665–672 (2021).
Department of Health and Human Services. Development of Standard Core Clinical Outcomes Assessments (COAs) and Endpoints (UG3/UH3 Clinical Trial Optional). https://grants.nih.gov/grants/guide/rfa-files/RFA-FD-21-004.html. Funding Opportunity Announcement (FOA) Number: RFA-FD-21-004 (2020).
Rutkove, S. B. et al. Improved ALS clinical trials through frequent at-home self-assessment: a proof of concept study. Ann. Clin. Transl. Neurol. 7, 1148–1157 (2020).
Jacobson, N. C., Kowatsch, T. & Marsch, L. A. Digital therapeutics for mental health and addiction: The state of the science and vision for the future (Academic Press, 2022).
Karimi, D., Dou, H., Warfield, S. K. & Gholipour, A. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020).
Li, W., Dasarathy, G. & Berisha, V. Regularization via structural label smoothing. In International Conference on Artificial Intelligence and Statistics, 1453–1463 (PMLR, 2020).
Ma, X. et al. Normalized loss functions for deep learning with noisy labels. In International Conference on Machine Learning, 6543–6553 (PMLR, 2020).
Zhang, L. & Qi, G.-J. Wcp: Worst-case perturbations for semi-supervised deep learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3912–3921 (2020).
Goldsack, J. C. et al. Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for biometric monitoring technologies (biomets). npj Digital Med. 3, 55 (2020).
Food & Administration, D. Center for Devices and Radiological Health & Center for Biologics Eva- luation and Research. General Principles of Software Validation; Final Guidance for Industry and FDA Staff (Food and Drug Administration, 2002).
Berisha, V. et al. Digital medicine and the curse of dimensionality. NPJ Digital Med. 4, 153 (2021).
Robin, J. et al. Evaluation of speech-based digital biomarkers: review and recommendations. Digit. Biomark. 4, 99–108 (2020).
Liss, J. & Berisha, V. Operationalizing clinical speech analytics: Moving from features to measures for real-world clinical impact. J. Speech Lang. Hear. Res. 1–7 (2024).
Mathad, V. C., Scherer, N., Chapman, K., Liss, J. M. & Berisha, V. A deep learning algorithm for objective assessment of hypernasality in children with cleft palate. IEEE Trans. Biomed. Eng. 68, 2986–2996 (2021).
Mathad, V. C., Liss, J. M., Chapman, K., Scherer, N. & Berisha, V. Consonant-vowel transition models based on deep learning for objective evaluation of articulation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 31, 86–95 (2022).
Xu, L., Liss, J. & Berisha, V. Dysarthria detection based on a deep learning model with a clinically-interpretable layer. JASA Express Lett. 3 (2023).
Moraffah, R., Karami, M., Guo, R., Raglin, A. & Liu, H. Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explor. Newsl. 22, 18–33 (2020).
Gibney, E. Is AI fuelling a reproducibility crisis in science. Nature 608, 250–251 (2022).
Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804 (2023).
Slavich, G. M., Taylor, S. & Picard, R. W. Stress measurement using speech: Recent advancements, validation issues, and ethical and privacy considerations. Stress 22, 408–413 (2019).
Kaushal, A., Altman, R. & Langlotz, C. Geographic distribution of us cohorts used to train deep learning algorithms. Jama 324, 1212–1213 (2020).
García, A. M., de Leon, J., Tee, B. L., Blasi, D. E. & Gorno-Tempini, M. L. Speech and language markers of neurodegeneration: a call for global equity. Brain 146, 4870–4879 (2023).
Berisha, V., Wang, S., LaCross, A. & Liss, J. Tracking discourse complexity preceding alzheimer’s disease diagnosis: a case study comparing the press conferences of presidents ronald reagan and george herbert walker bush. J. Alzheimer’s. Dis. 45, 959–963 (2015).
Berisha, V. et al. Float like a butterfly sting like a bee: Changes in speech preceded parkinsonism diagnosis for muhammad ali. In INTERSPEECH, 1809–1813 (2017).
Seabrook, E. M., Kern, M. L., Fulcher, B. D. & Rickard, N. S. Predicting depression from language-based emotion dynamics: longitudinal analysis of facebook and twitter status updates. J. Med. Internet Res. 20, e168 (2018).
BN, S., Rajtmajer, S. & Abdullah, S. Differential Privacy enabled Dementia Classification: An exploration of the privacy-accuracy trade-off in speech signal data. In Proc. INTERSPEECH 2023, 346–350 (2023).
Saifuzzaman, M., Ananna, T. N., Chowdhury, M. J. M., Ferdous, M. S. & Chowdhury, F. A systematic literature review on wearable health data publishing under differential privacy. Int. J. Inf. Secur. 21, 847–872 (2022).
Rieke, N. et al. The future of digital health with federated learning. NPJ Digit. Med. 3, 119 (2020).
Atleson, M. Keep your AI claims in check. https://www.ftc.gov/business-guidance/blog/2023/02/keep-your-ai-claims-check. Accessed: 2023-11-21 (2023).
Dunworth, K. et al. Using “real-world data” to study cleft lip/palate care: An exploration of speech outcomes from a multi-center us learning health network.Cleft Palate Craniofac. J. 10556656231207469 (2023).
Ji, C. et al. The application of three-dimensional ultrasound with reformatting technique in the diagnosis of fetal cleft lip/palate. J. Clin. Ultrasound 49, 307–314 (2021).
Andreassen, R. & Hadler-Olsen, E. Eating and speech problems in oral and pharyngeal cancer survivors–associations with treatment-related side-effects and time since diagnosis. Spec. Care Dent. 43, 561–571 (2023).
Chen, J. et al. Preoperative voice analysis and survival outcomes in papillary thyroid cancer with recurrent laryngeal nerve invasion. Front. Endocrinol. 13, 1041538 (2022).
Brockmann-Bauser, M. et al. Effects of vocal intensity and fundamental frequency on cepstral peak prominence in patients with voice disorders and vocally healthy controls. J. Voice 35, 411–417 (2021).
Mavrea, S. & Regan, J. Perceptual and acoustic evaluation of pitch elevation to predict aspiration status in adults with dysphagia of various aetiologies/beyond stroke. Dysphagia 33, 532–533 (2022).
Hurtado-Ruzza, R. et al. Self-perceived handicap associated with dysphonia and health-related quality of life of asthma and chronic obstructive pulmonary disease patients: A case–control study. J. Speech, Lang., Hear. Res. 64, 433–443 (2021).
Folstein, S. E., Leigh, R. J., Parhad, I. M. & Folstein, M. F. The diagnosis of Huntington’s disease. Neurol. 36, 1279 (1986).
Wilson, S. M. & Hula, W. D. Multivariate approaches to understanding aphasia and its neural substrates. Curr. Neurol. Neurosci. Rep. 19, 1–9 (2019).
Ross, E. D. Disorders of vocal emotional expression and comprehension: The aprosodias. Handb. Clin. Neurol. 183, 63–98 (2021).
Robin, J. et al. Automated detection of progressive speech changes in early Alzheimer’s disease. Alzheimer’s. Dement. 15, e12445 (2023).
Acknowledgements
This work is funded in part by the John and Tami Marick Family Foundation, NIH NIA grant 1R01AG082052-01, NIH NIDCD grants R01DC006859-11 and R21DC019475, and NIH NIDCR grant R21DE026252-01A.
Author information
Authors and Affiliations
Contributions
V.B. and J.M.L. both made substantial contributions to the conception or design of this work and they helped draft and revise the manuscript. Berisha is the corresponding author: visar@asu.edu.
Corresponding author
Ethics declarations
Competing interests
The Authors declare no Competing Non-Financial Interests but the following Competing Financial Interests: Berisha and Liss are founders of and previously held equity in Aural Analytics, a clinical speech analytics company.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Berisha, V., Liss, J.M. Responsible development of clinical speech AI: Bridging the gap between clinical research and technology. npj Digit. Med. 7, 208 (2024). https://doi.org/10.1038/s41746-024-01199-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-024-01199-1