Identifying neurocognitive disorder using vector representation of free conversation

In recent years, studies on the use of natural language processing (NLP) approaches to identify dementia have been reported. Most of these studies used picture description tasks or other similar tasks to encourage spontaneous speech, but the use of free conversation without requiring a task might be easier to perform in a clinical setting. Moreover, free conversation is unlikely to induce a learning effect. Therefore, the purpose of this study was to develop a machine learning model to discriminate subjects with and without dementia by extracting features from unstructured free conversation data using NLP. We recruited patients who visited a specialized outpatient clinic for dementia and healthy volunteers. Participants’ conversation was transcribed and the text data was decomposed from natural sentences into morphemes by performing a morphological analysis using NLP, and then converted into real-valued vectors that were used as features for machine learning. A total of 432 datasets were used, and the resulting machine learning model classified the data for dementia and non-dementia subjects with an accuracy of 0.900, sensitivity of 0.881, and a specificity of 0.916. Using sentence vector information, it was possible to develop a machine-learning algorithm capable of discriminating dementia from non-dementia subjects with a high accuracy based on free conversation.

www.nature.com/scientificreports/ et al. attempted to discriminate 9 AD and 9 controls by machine learning using lexical and acoustic features as features in the recorded semi-structured interview data 17 . The maximum discrimination accuracy between AD and control was 0.88 (sensitivity 0.83 specificity 0.90). Mirheidari et al. recorded the conversations of 15 patients with neurodegenerative disorders and 15 patients with Functional Memory Disorder who visited a memory clinic while being examined by a neurologist 18 . Then they attempted to discriminate between the two groups by machine learning, using lexical features, acoustic features, and subjects' verbal responses to specific questions (e.g., who is most concerned about their symptoms?) and non-verbal information such as 'patient turns to other' as features. The results showed a maximum accuracy of 0.97, but the lexical features had the smallest contribution to the prediction. All of these studies were conducted on a small number of cases. On the other hand, studies using the DementiaBank dataset were limited to subjects with AD, however, it is necessary to evaluate cognitive function regardless of the cause of dementia in order to apply them to real-world clinical practice. Therefore, the purpose of this study was to create a machine learning model to discriminate dementia, including non-AD, and non-dementia subjects by applying NLP to unstructured free conversation data.

Datasets.
A total of 590 datasets (n = 161) were collected as a total study sample, including dementia, mild cognitive impairment (MCI), and CHC. For this study, 432 (n = 135) of the 590 datasets were analyzed after excluding datasets collected from participants who were under 45 years old, data with a GDS score of 10 or higher, data with missing data, and data in which subjects spoke in a strong dialect (Fig. 1 Prediction accuracy. A threshold of 1, which was the threshold with the highest prediction accuracy, was adopted for the number of votes for the prediction model. The results for discriminating between dementia and non-dementia using all the datasets showed an accuracy of 0.900, a sensitivity of 0.881, and a specificity of 0.916 (Table 2). Average AUC was 0.935 (Fig. 2). There was no statistically significant difference between  www.nature.com/scientificreports/ the prediction accuracy for the data that met the criteria for use as training data and that of the overall data (χ 2 = 0.402, p = 0.526). There was no statistically significant difference in the prediction accuracy according to sex (χ 2 = 0.015, p = 0.901). There was no statistically significant difference in the prediction accuracy between those 75 years of age and older and those less than 75 years (χ 2 = 2.902, p = 0.088).

Number of letters and prediction accuracy.
The relationship between the number of letters in the text and the prediction accuracy is shown in Fig. 3. The prediction accuracy exceeded 0.8 at 600 letters and appeared to reach a plateau of 0.866 at 1300 letters. The peak was 0.875 at 1800 letters.     Table 3, with one vote using our original vector and DNN showed the highest accuracy. Including the above, the accuracy, sensitivity and specificity using other document embeddings and machine learnings are shown in Table 4. It was found that the DNN for the data vectorized with the original algorithm had the highest prediction accuracy.

Discussion
In this study, we used a total of 432 datasets to build a machine learning model to detect cognitive decline corresponding to dementia using text data transcribed from unstructured free conversation audio data. The final machine learning model classified dementia and non-dementia data with an AUC 0.93, accuracy of 0.900, a sensitivity of 0.881, and a specificity of 0.916 based on the largest amount of data ever reported in a study aimed at detecting dementia using NLP and machine learning based only on free conversation. Of the studies using the DementiaBank dataset, Alkenani et al. reported an accuracy of 0.95 and an AUC of 0.98 for using only linguistic features 20 . They use lexicosyntactics and character n-gram spaces as features and combine the predictions of multiple classifiers in heterogeneous ensemble methods. We considered that accuracy could be improved by using a corpus of spoken language rather than a corpus of written language. On the other hand, since there is the absence of available Japanese corpus of spoken language, we decided to use a method that does not require a large training data set. In machine learning, we also decided to attempt to use DNN. It is important to note that, although our model did not compete with their AUC, our free conversation NLP, which does not use a task, was still accurate enough for screening.  www.nature.com/scientificreports/ The most widely used screening for dementia in real-world clinical practice is the MMSE. The sensitivity and specificity of the MMSE are reported to be 0.81 and 0.89, respectively 21 , and the results of this study are similar or even better, with a sufficient screening ability. In terms of the relationship between the number of letters and the prediction accuracy, we obtained an accuracy of 0.8 at 600 letters. Since daily conversation in Japanese is considered to be between 360 and 420 letters per minute, this means that we should be able to obtain 600 letters from approximately 100 s of a subject's speech. Similarly, it is calculated that it takes 3-5 min for the correct prediction rate to reach its peak. Therefore, our machine learning model is fully applicable to real-world clinical practice.
The domains of cognitive function affected by dementia are heterogeneous, and it seems difficult to capture the various cognitive declines only by linguistic aspects. However, neurodegenerative diseases are thought to damage the neurons that control cognition, speech, and language processes, and consequently affect the linguistic aspects of patients 22 . If we consider that various cognitive functions lead to language dysfunctions, then it is also possible that various cognitive impairments are reflected in language. In fact, linguistic changes in AD patients are known to occur early in the course of the disease [23][24][25][26][27] , and language function has been found to be correlated with overall cognitive function 28 . In addition, previous studies using NLP and machine learning have shown that AD can be identified with high accuracy [10][11][12][13][14][15][16][17][18] .
While recent studies using NLP and machine learning to detect dementia have used tasks such as the picture description task to encourage spontaneous speech, the results of this study show that it is possible to screen for dementia using unstructured free conversation without requiring a specific task. A screening method using free conversation is simple and easy to disseminate as a screening method because it can be performed without a need for specific tools. Furthermore, telemedicine is now widely used in many countries, especially with the COVID-19 outbreak 29 , but cognitive function tests can be difficult to perform using telemedicine; instead, freeconversation screening might be useful as a remote screening tool. In addition, since free conversation does not have a correct answer, unlike task-based testing, learning effects might be less likely to occur.
The present study had the following limitations. In this study, MCI and CHC samples were combined into a non-dementia class, and binary classification versus dementia was performed. We plan to further investigate whether it is possible to discriminate between CHC and MCI, and perform a ternary classification as we accumulate more data. Moreover as with the MMSE and other screening methods, the education level and age of the subject may affect the results of this method, and this possibility will require further investigation. In addition, the interview used in this study was designed to encourage speech by asking simple questions such as "How are you feeling today?", but since communication involves interactions, the score may be affected by the ability and attitude of the interviewer.
Although the number of datasets must be increased before application in actual clinical practice, the results of this study show that it is possible to construct a machine learning algorithm that can discriminate dementia from non-dementia with a high accuracy based on a free conversation without the use of a task. In the future, this methodology could be useful as a screening tool for dementia.

Methods
Data source. Data from the Project for Objective Measures using Computational Psychiatry Technology (PROMPT) 30 was used in this study. This study was a prospective observational multicenter study performed in Japan with the goal of identifying objective markers using voice and speech, body motion, facial expression, and daily activity data for mood disorders and dementia. Participants were recruited at the psychiatry departments of 10 different medical facilities, and each ethics committee, including that of the Keio University School of Medicine, approved the study. All the participants provided written informed consent before participating in this study, which was designed in accordance with ethical principles based on the Declaration of Helsinki. The recruitment period was from March 9, 2016, to March 31, 2019. This study used data from patients diagnosed as having major neurocognitive disorder or mild neurocognitive disorder according to the Diagnostic and Statistical Manual of Mental Disorders 5 (DSM-5) criteria and from participants recruited as cognitively healthy controls (CHC). CHC were also screened for a history of mental disorders using the Mini-International Neuropsychiatric Interview (M.I.N.I.), and CHC with any history of psychiatric disorder were excluded. Individuals with apparent speech problems including aphasia and dysarthria were also excluded.
The participants were given 10 min to have an unstructured conversation with a psychiatrist or psychologist, such as an interview on the topic of their mood or daily living. During that time, their speeches were recorded with a microphone. After the interview, the Clinical Dementia Rating (CDR), Mini-Mental State Examination (MMSE), Logical memory of the Wechsler Memory Scale-Revised, and Geriatric Depression Scale (GDS) were evaluated. If the participant agreed, a similar interview was conducted and the above data were collected once again after a minimum interval of 4 weeks.

Data eligibility.
In the present study, we analyzed the data described above. To eliminate the effect of depressive symptoms on cognitive function, the resulting data was excluded from the analysis if a participant's GDS was 10 or higher. In addition, data from subjects under the age of 45 years, data with missing conversational data or rating data, and data in cases where the subject spoke with a strong dialect were excluded from the analysis.
Data labeling. In this study, we aimed to develop a system capable of screening for dementia. Therefore, we attempted to construct a machine learning model to discriminate between dementia and non-dementia, including CHC and MCI. Dementia and non-dementia were determined by three neuropsychological tests, namely the CDR, MMSE, and logical memory II. The cutoff for the logical memory II test was based on the education history: subjects with 0-9 years of education scored 2 points or less, subjects with 10-15 years of education scored 4 points or less, and subjects with 16 or more years of education scored 8 points or less. Dementia was To improve the accuracy of the machine learning, we decided to use data that reflects typical symptoms for training. Therefore, data that fell into the following categories were used not only as test data, but also as training data: dementia with CDR ≥ 1, MMSE ≤ 23, and logical memory II below the cutoff; non-dementia (MCI) with CDR = 0.5, MMSE ≥ 24, and logical memory II below the cutoff; and non-dementia (CHC) with CDR = 0, MMSE ≥ 24, and logical memory II above the cutoff. Data that did not meet these criteria were used only as test data.
In this study, data were acquired multiple times from the same participant, so it was possible for the same participant to have different states depending on the time when the conversational data was acquired (e.g., after conversion from MCI to dementia). Therefore, each data was labeled using the result of the cognitive evaluation performed at the time of the recording of the conversational data. Document embedding. From the recorded data, only the subject's speech was transcribed into text data, including fillers, and compiled into a single document. This document was transformed into a vector represented by 150-dimensional features using previously reported technology 31 . In the present study, we set the negative sampling value to 5 and the number of dimensions to 150, and we finally obtained a 150-dimensional document vector from the morpheme elements. In addition, the same method was used to create a 50-dimensional vector using bi-grams of parts of speech as input features, for a total of 200 dimensions for morphemes and parts of speech.
Machine learning procedure. In this study, we built a DNN-based prediction model that discriminated between two classes of dementia and non-dementia. The DNN model was constructed using Python 3.6, tensorflow 2.20 library, and a five-layer neural network consisting of an input layer, three hidden layers, and an output layer. The various hyperparameters were optimized using Optuna 2.0.0. Leave-One-Out Cross-Validation (LOOCV) was used for model building and performance evaluation. Since it was possible for multiple data acquisitions to be obtained from the same subject in this study, there was a risk that speech data from the same subject could be used in both the validation and training sets, which would improve the apparent accuracy. To avoid this effect, we added a process to exclude text data from subjects who had provided validation data from being used as training data. The details are as follows. The architecture of the machine learning and validation methods is depicted in the Fig. 4. www.nature.com/scientificreports/ (i) Extract one test data from all data.
(ii) From the remaining data, exclude data from the same subjects as the test data and data that do not meet the criteria for training. (iii) Randomly split the remaining data so that the ratio of dementia to non-dementia is kept constant and the ratio of training data to validation data is 3:1. To consider the effect of random splitting, create 10 sets of training and validation datasets with different splits. (iv) Build 10 prediction models with the 10 sets of training and validation datasets.
(v) Repeat the above steps from i to iv for the number of samples.
The prediction accuracy of the results of voting by the 10 prediction models was calculated for one test data. The threshold for the number of votes that determines the prediction for the entire model was adopted so as to yield the highest accuracy. Thereafter, the accuracy, sensitivity, and specificity in this setting were used as evaluation indices for the prediction models. For the purpose of calculating AUC, Receiver Operating Characteristic (ROC) curves were created for the 10 models used for voting, and the average AUC was calculated.
As a sub-analysis, we also evaluated the prediction accuracy when the data were divided into two groups according to sex and according to age (75 years and older vs. less than 75 years).

Relationship between the number of letters and prediction accuracy.
To examine the effect of utterance length on the prediction accuracy, we prepared text data with different document lengths in units of 100 letters from the beginning of each text and converted each of them into a 200-dimensional vector. Predictions were made on this vector using a model built with LOOCV, and the document length and prediction accuracy were evaluated. To predict each vector, we used a model designed to predict the original document before changing the document length.

Verification of vectorization and machine learning algorithms.
To compare our document embedding as well as machine learning procedure with other methods, we calculated the prediction accuracy using TF-IDF and BERT for the vectorization and using Naive Bayes, Logistic regression, Support Vector machine, and XGBoost for machine learning, respectively. In the XGboost model, 10 sets of training and validation patterns were created for one test data extracted by LOOCV, and prediction by voting using 10 models was performed. We also performed vectorization using TF-IDF and Japanese BERT, and calculated the prediction accuracy of voting using the 10 models trained by DNN.