Introduction

The ubiquity of smart mobile devices (e.g., smartphones) opens new opportunities for data collection in healthcare. Ecological Momentary Assessment (EMA)1 is a research method that aims to assess phenomena with a focus on ecological validity by allowing subjects and patients to report repeatedly in real time, in real environments, over time, and in different contexts, thus avoiding retrospective reporting bias2. The mHealth platform TrackYourTinnitus (TYT) uses EMA in combination with mobile crowdsensing (MCS) to track a user’s individual tinnitus and monitor and assess its variability over time3,4. More specifically, baseline questionnaires are first used to collect users’ sociodemographic data. Then, a dynamic EMA questionnaire (EMA-D) is used to repeatedly assess the user’s tinnitus at randomly selected times of day using eight dimensions. These dimensions include the perception (i.e., whether tinnitus is perceived at that moment) of the tinnitus (first question of the EMA-D questionnaire) in addition to its loudness and distress, as well as mood, arousal, stress, concentration, and the presence of the user’s previously reported worst symptom4,5. The response to the first question (i.e., perceived presence) of the EMA-D questionnaire is highly clinically relevant. Therefore, it would be strongly desirable to be able to predict the presence of tinnitus based on the other information provided by the app users. Consequently, the first question of the dynamic questionnaire is investigated in this paper and whether it can be predicted using machine learning methods.

On the further background of TrackYourTinnitus (TYT), it was launched in 2014. It follows the study design of an open observational study in which users can freely enroll. TYT is not and has not been advertised and we follow the principle of word of mouth as well as discovering the apps in the app stores. The general procedure consists of two parts after downloading the app, 3 baseline questionnaires have to be filled in at the beginning, concerning demographics and tinnitus history. Only after completing these questionnaires, users can enter the dynamic EMA questionnaires (EMA-D). Here, the user can choose to have up to 14 questionnaires randomly delivered per day or they can set the times themselves (in both cases, a notification will be sent). In addition, the questionnaire can always be self-initiated. The aim is to record the current tinnitus situation as best as possible with as many unpredictable points in time as possible. The principle has now been working very successfully since 2014 and many more of our apps have followed this principle.

Furthermore, in several studies, EMA has proven to be a highly versatile method for data collection. By collecting longitudinal data with high ecological validity, EMA is often used to gain insight into psychological and behavioral phenomena6,7,8. With regard to data assessment, a pilot study was conducted showing that EMA is an applicable method and an accepted assessment method for tinnitus data9. The latter is supported by the results of another study showing that emotional states influence the process of how tinnitus loudness leads to tinnitus distress10.

When using smart mobile devices, EMA is often combined with MCS to collect contextual data3,4,5,11. In a exploratory study, opportunistic MCS was used to collect smartphone usage logs as a proxy measure for mood monitoring12. The results show that highly accurate predictions based on EMA in combination with MCS are possible.

Because of the quantity of data points, longitudinal data are ideally suited for modern data analysis techniques that require large data sets. For example, longitudinal EMA data were used in one study to detect warning signs of suicidal ideation using functional linear models13. By collecting response times of EMA questions, another paper introduced a method for predicting short-term mood changes14. Similarly, EMA metadata sourced from the TYT platform was used to make predictions in a study that was able to predict users’ operating system with up to 80% accuracy15. While there are some machine learning approaches in the context of tinnitus, to our knowledge, there is no clear a priori model for predicting tinnitus using EMA data. Therefore, we take a data-centric approach by testing different classification models (which take different approaches to classification) to determine the model that best fits the context. In addition, the aspect of small gains in predictive power is also tracked. In the context of tinnitus, little is known about the important factors and due to the heterogeneity of the disease pattern, even small differences may play a role.

As mentioned above, the responses of the EMA-D questionnaire will be examined. In particular, the question will be answered to what extent the questions of the EMA-D questionnaire that do not assess the tinnitus experience per se (i.e., loudness and distress of the tinnitus) can be used to predict the answer to the first question, whether someone is aware of the tinnitus at a particular point in time. In addition, for the first time, the study will take into account another circumstance that has been ignored in most of the previous TYT analyses (e.g.,16,17: one problem with open-label observational studies using MCS and EMA such as TYT is that the frequency of responses from participating users varies widely18,19). There are many reasons for this. Some reasons could be elicited in interviews, such as a user feeling out of treatment and grasping at the proverbial last straw. Other reasons can only be conjectured. However, the way an app is designed and what feedback is given back to the user may play a role. Nevertheless, the paradigms of EMA and MCS in the application of medical questions are still very young, which is also reflected in the fact that there are still few results supporting the evidence. In the context of evidence, for example, it is noticeable that the WHO mERA checklist20 is very rarely cited or discussed. Given the variation in response behavior (i.e., frequency of responses among participating users), the prediction of the first question should consider different user groups. This is to address the question of whether the planned prediction should specifically consider different user groups and whether there are indicators to take this clinically more into account in the future.

In order to distinguish user groups, the TYT data was closely examined and threshold values were identified to divide the user groups. First, only users who completed more than 10 EMA-D questionnaires were included (threshold 1). Then, the upper limit of 400 completed EMA-D questionnaires was identified (threshold 2), which can distinguish between normal users (11-400 completed EMA-D questionnaires; referred to as Normal Users) and power users (more than 400 completed EMA-D questionnaires; referred to as Power Users). The number of EMA-D questionnaires per user are depicted in Fig. 2. Another threshold 3 was identified whether users actually report tinnitus and non-tinnitus in the dynamic questionnaire. Here, we arrived at a further distinction of whether both values were recorded at least once for each user (labeled Non-permanent Tinnitus Users), or presence at least once and absence at least three times (labeled Rather absent Tinnitus Users). This resulted in a stratification of 5 user groups. To our knowledge, such a subdivision has not been done in other mHealth work related to tinnitus, but neither has it been done in other mHealth work in general.

Regarding the prediction of whether someone perceives tinnitus or not, another disparity in the data set is how often someone perceives one or the other status. This resulted in us having very unequal data sets not only in terms of frequency of completed questionnaires, but also in terms of classes of a user (class 0: do not perceive tinnitus, class 1: perceive tinnitus). To account for this, two approaches were taken to balance the data set. On the one hand, when possible, we automatically adjusted the weights assigned to the classes for the machine learning procedures inversely proportional to the class frequencies in the input data (we call this balanced) to account for the imbalance in the dataset. On the other hand, entries in the majority class were eliminated (we call this downsampling) to have the same number as in the minority class. These variants are also presented in the context of the results. Note that we did not add entries in the minority class (i.e., upsampling using approaches such as SMOTE21), as this artificially generates responses to the EMA questionnaire that would not otherwise exist.

Based on the above findings and work, in this paper, we investigate the following research questions using machine learning methods. It should be noted that for each of the predictions, we only considered independent variables (i.e., variables that are non-trivially indicative of the presence of tinnitus) to predict the presence of tinnitus for a given EMA questionnaire:

RQ1::

To what extent can we predict the presence of tinnitus for a given EMA-D answer sheet?

RQ2::

To what extent does the number of completed EMA-D questionnaires per user influence the prediction?

RQ3::

To what extent does the type of tinnitus (non-permanent and rather absent) influence the prediction?

Results

The results of the predictions are presented below. As mentioned earlier, the predictions were performed based on five user groups, so the results are presented along these user groups. As mentioned earlier, two strategies were followed to balance the dataset, and this distinction is also reflected in the results.

The eight machine learning approaches used provided different results, both in terms of the different research questions and the evaluation metrics. In summary, the results suggest that there is some relationship between the analyzed data and the presence or absence of tinnitus, which may help to better understand the causes of tinnitus.

Results balanced

The results for the case where all users are included in the classification with the balanced approach are shown in Table 1. It is noticeable that the different classifiers are not very far apart, even though the MLP classifier performs best. About 70%-predictive power can be achieved for all users in the balanced case. The largest area under the curve is achieved by LRC.

Table 1 Results for all users, including standard deviation.

The results for the Power Users in the case of balanced are shown in Table 2. It is noticeable that the results vary more between the classifiers in the balanced case than for all users, but that a higher prediction can also be achieved (although even higher results were expected). The largest area under the curve is provided by the SVM.

Table 2 Results for Power Users, including standard deviation.

In the cases of Normal Users plus balanced approach (see Table 3), the bandwidth between the results of the classifiers is again smaller, and it is also noticeable that the best prediction here is worse than in the previous two cases of Power Users and all users. Here, it is interesting that the XGB produces the best result the first time. LRC has the largest area under the curve.

Table 3 Results for Normal Users, including standard deviation.

For the Non-permanent Tinnitus Users plus balanced approach (see Table 4), no new findings can be made, except that the prediction rates are generally somewhat lower.

Table 4 Results for Non-permanent Tinnitus Users, including standard deviation.

For the Rather absent Tinnitus Users plus balanced approach (see Table 5), the results are even lower than for Non-permanent Tinnitus Users. This indicates above all that the tinnitus and non-tinnitus classes are too poorly distributed in question 1 of the EMA-D questionnaire. Furthermore, as we analyze specific subsets of users, we would have expected higher results compared to non-permanent Tinnitus Users.

Table 5 Results for Rather absent Tinnitus Users, including standard deviation.

Across all datasets, the multi-layer perceptron classifier (i.e., neural network) is the approach with the best accuracy. The 10-fold cross validated AUC of the MLP classifier for all users is depicted in Fig.  1. However, other approaches (e.g., KNC, XGB, or RFC) also provide promising results. Illustrations of all other approaches can be found in the supplementary material.

Figure 1
figure 1

10-folds ROC curve, AUC MLP.

Results downsampling

In the case of downsampling for the five user groups, it is noticeable that the prediction rates decrease and that in some cases other classifiers only perform better than when balanced. This suggests that imputation can either be improved or is even counterproductive in this case. However, it could also indicate that very meaningful values were eliminated during downsampling.

When all users are considered (see Table 6), the results do not vary much between classifiers, but for the first time, the RFC classifier is stronger than before.

Table 6 Results for all users (downsampled), including standard deviation.

For Power Users, the picture is similar (see Table 7). However, compared to the balanced approach, the general jump between all users and Power Users is larger this time.

Table 7 Results for Power Users (downsampled), including standard deviation.

No further insights are gained from the Normal Users (see Table 8). Only, as with the balanced approach, the same classifier beats all others in all metrics, except for sensitivity.

Table 8 Results for Normal Users (downsampled), including standard deviation.

There is no further findings compared with the previous results for Non-permanent Tinnitus Users, as shown in Table 9.

Table 9 Results for Non-permanent Tinnitus Users (downsampled), including standard deviation.

With Rather absent Tinnitus Users and downsampling (see Table 10), the results are closer to Non-permanent Tinnitus Users than in the balanced case.

Table 10 Results for Rather absent Tinnitus Users (downsampled), including standard deviation.

The basic prediction rate varies between 60 and 70%. At the subtle level, we find differences that are rather unusual, but can also occur depending on the strategy, so they may not have any medical significance. Also, one would have expected the prediction to perform disproportionately better for Power Users than for the other classes. Moreover, it is interesting that the algorithms sometimes predict strikingly differently between the balanced and downsampling variants.

Discussion

At the beginning of the work, we hypothesized along the five user groups with which significance the presence of a tinnitus perception — asked by question 1 in the EMA-D questionnaire — can be predicted by the other questions of the EMA-D questionnaire. For this purpose, we excluded those questions from the EMA-D questionnaire that showed medical correlation (dependent variables, i.e., questions 2, 3, and 8). Since the same objective is in the room for all five user groups and the results do not differ significantly, we discuss the results in their entirety with regard to the three research questions raised. Basically, it is noticeable that despite the different scenarios, a prediction in the range of 50–70% is always possible, with the prediction tending towards 60–70%. Since question 1 of the EMA-D questionnaire has a binary scale (cf. Table 11; 0: no tinnitus perceived, 1: tinnitus perceived), the ground truth in this case is 50%. Comparing the results with the ground truth, a significant improvement can be obtained, indicating that the dynamic questions of the EMA questionnaire not directly related to question 1 contain predictive power, again for all user groups. However, the TYT dataset is not a small mHealth dataset, so one would have expected a higher predictive value than 70% fluctuating. Therefore, the result should be considered indicative only, even though the ground truth might be exceeded.

However, it is noticeable that the Power Users can only achieve a small improvement in the prediction, especially in the case of the downsampling scenario with 77.7%. Comparing this with the best result of all users (66.8%), the difference is not that big. On the one hand, this is unfortunate, because one could have expected a better prediction here, since these users use the system more and probably give honest answers over time. On the other hand, the result is very good, at least with respect to the research questions of this paper, namely that the totality of all measurements already gives a representative picture. What provides another very good statement is the fact that in the case of downsampling and Power Users the best possible result was obtained. Again, no data was imputed in the case of downsampling, so obviously the pure data from users who frequently use the platform is more meaningful at the rating level, which supports the basic tendency of the positive prediction of question 1 of the EMA-D questionnaire.

Across all scenarios and data sets, the classifications score results above the baseline regarding sensitivity. In other words, they are able to predict cases of tinnitus. Regarding specificity, some classifications (i.e., DT, KNC, MLP, RFC, and XGB) yield results below the baseline for the balanced scenario, indicating that the prediction of cases in which no tinnitus was reported might not be ideal. This is, however, not the case for the scenario downsampling and may indicate that the downsampling approach should be preferred in case the prediction of no tinnitus cases is relevant.

It can also be seen that the two strategies balanced and downsampling produce little difference, at least as far as the prediction rates are concerned. The differences in the classifications related to balanced and downsampling are partly considerable. For example, in the case of balanced, the MLP classifier performs best on average, while in the case of downsampling, the RFC and XGB classifiers perform better. Since both are tree methods, it is also possible that values are eliminated in downsampling or added in balanced that influence the respective tree method. This shows nicely how the individual methods react to data changes. Since the basic tendency of the predictions remains the same, it can be assumed that the different performance of the classifiers is due to the balanced and downsampling strategies and has nothing to do with medical significance. Precisely for the reasons mentioned above, we consider the work as a further contribution in the field of mHealth / Machine Learning / Tinnitus as well as mHealth / Machine Learning in general. When classifying at assessment level, ML-methods seem to perform similarly on tabular mHealth data, again independent of how the dataset is processed in terms of balance as shown in this work. One could also speak here of a comprehensive testing of the mentioned strategies for the assessment level. Since the dependent questions in the EMA-D questionnaire were also excluded, it can be assumed that with more data, tinnitus perception can be predicted even better on the basis of the independent variables used in our EMA-D questionnaire: mood, arousal, stress, concentration. Therefore, a first indicator can be given that the independent variables are able to predict the momentary tinnitus on the assessment level.

In terms of clinical relevance, the outcome of this paper has enabled two main contributions. First, we want to better understand and learn how smartphone apps need to be developed to best support tinnitus patients in particular and patients in general, including in daily data collection. Only in this way, will we be able to collect data in the future that will give us further insight into tinnitus. Identifying subgroups is highly relevant clinically, as there is no general treatment method and thus the hope is to develop specific treatments for subgroups. Findings like this take us further to better address subgroups. For example, as an outcome, we need to identify early on what are power users and what are normal users to incentivize accordingly. Second, the realization has matured that, data-driven, the various alternatives (classifiers, data preparation by the imbalances) related to subgroups do not have as great an impact as might have been suspected. Nevertheless, we will need more power users, because they promise us the deeper results in the future. Thus, it is imperative to find motivational mechanisms that do not have a detrimental effect on the other side (that data are collected only due to the motivational mechanism).

The work has some limitations, which must be considered carefully. First, the prediction is performed on the basis of the assessment level. In other work, it could be shown what a difference this can make with respect to the user level17,22. Nevertheless, we are currently comparing the user and assessment level with the various options in a larger study. A further bias may arise from the user interface elements of the independent questions. Sliders bear the risk of the anchor effect23 and can thus distort a result15, since the independent questions are all slider questions, this has to be taken into account. Furthermore, a selection bias can be assumed, even though many things were compared in Table 12. Since TYT is an open observational study, i.e., we do not hand out a clear study protocol, the participating users can very much decide for themselves how the app is used (i.e., the number, daytime, pattern (random during the day / always at the same time) can be defined individually), so a selection bias is to be expected24. Additionally, we do not obtain information about the number of prompts a user is exposed to. Consequently, some individuals may respond to all prompts they are exposed to, whereas others may only respond to few of the EMA prompts. Therefore, we cannot examine how the number of prompts completed is related to the number of prompts to which users were exposed. However, due to the number of EMA responses, it is to be assumed that power users respond more frequently to prompts compared to normal users. Since Android and iOS never look 100% the same in terms of user interface, information bias must also be assumed, since TYT is available on both platforms. In principle, the two strategies balanced and downsampling can also be seen as bias, but since the results do not vary greatly, this bias can be seen as smaller. However, the fact that the users answered very strongly in the two classes whether tinnitus is perceived or not, this basic circumstance must be seen as bias. We do not know whether the classifiers could distinguishably see enough perceived and not perceived tinnitus states to an individual. Another limitation arises from the different user groups. In this case both tests for dependent as well as independent samples are not 100% correct, because there are partial dependencies and overlaps between the user groups. However, to ensure comparability between the user groups, we conducted tests for independent samples.

We view the analyses performed as another indicator that machine learning in mHealth data presents many challenges that must be approached with caution. From the data collection side (information bias, e.g., slider or Android vs. iOS) to the nature of the mHealth study (selection bias, e.g., for what reasons do I answer), there is much to consider that is not related to the actual machine learning. There is also the issue of assessment and user level, and how to validate to get more robust results. In this tension, it seems interesting for TYT to note that at the assessment level, the different strategies followed do not have as much impact as one might have suspected. It can be concluded that the as-is data with good cross-validation leads to meaningful results, at least in the sense of what the data basically give. The next point that should definitely be explored is how user and assessment levels differ.

Materials and methods

Overview

Data source

The TYT mHealth platform has been in operation since 2014 and has been continuously evolved since then. The platform consists of a registration and information website (https://www.trackyourtinnitus.org/), a native mobile application available for both iOS and Android, and a central backend that stores the collected data in a relational database. The mobile applications (iOS, Android) track users’ individual tinnitus and tinnitus-related variables by asking them to complete EMA assessment questionnaires at randomly selected times of day (so-called EMA-D questionnaires). The structure of the EMA questionnaire is shown in Table 11. The exact procedure of the TYT application has been described in previous work4,5.

Tinnitus is the perception of an internal sound in the ears with no corresponding external sound. Symptoms have been found to be subjective and to vary over time. Therefore, TYT was developed to assess this individual variability of symptoms based on EMA and MCS3. TYT uses self-report questionnaires as a data collection tool to collect data on the user’s individual tinnitus. The collected responses, in turn, are stored in the central relational database along with a set of metadata (e.g., timestamp and user agent). Details of the TYT dataset as well as the underlying database structure have also been described in previous work24.

Table 11 Questions of the EMA-D questionnaire in the TYT smartphone application, along with their scale and the dimension that is measured3,5.

User statistics

To ensure comparability between user groups, we examined important clinical baseline characteristics of age distribution, handedness, and family history of tinnitus. For age distribution, there were no statistically significant differences between group means, as revealed by a one-way ANOVA (\(F(4, 1621) = 1.81, p=0.12\)). For handedness, a \(\chi ^2\)-independence test showed that there was no significant relationship between user groups, \(\chi ^2(8, N=1654) = 0.91, p=1.0\). We obtained the same result for tinnitus family history, \(\chi ^2((4, N=1650) = 0.16, p=1.0\). The different degrees of freedom resulted from missing values for some users. A summary of baseline characteristics for the five user groups is provided in Table 12.

Table 12 Statistical comparison of the five user groups. \(\chi ^2\) tests for handedness (\(\chi ^2(8, N=1654) = 0.91, p=1.0\)) and family history of tinnitus complaints (\(\chi ^2((4, N=1650) = 0.16, p=1.0\)) suggest that there are no significant differences between the groups. The same result appears for the age distributions.

Data preparation

To answer research questions 1–3 and understand the machine learning analysis, Fig. 3 illustrates the steps taken during our data preparation. For the machine learning analysis, we considered 5 datasets, each corresponding to a research question. According to question 1, we analyzed 45,935 answers to the harmonized EMA-D questionnaire from all users (DS1). To further validate the results of the machine learning analysis, and to obtain different insights into the presence or absence of tinnitus, and also to answer research questions 2 & 3, we considered four additional subsets. Power Users (DS2) are users with 400 or more EMA questionnaire answers, i.e., the users in the top 5% quantile regarding the number of completed EMA questionnaires. They use the platform excessively and we assume that their subset therefore contains insightful knowledge, forming the first sample to answer RQ2. Normal Users (DS3) have completed less than 400 EMA questionnaires. Consequently, they correspond to the other 95% of users forming the second sample for RQ2. Since they are the opposite of Power Users, we hope to find new correlations in this subset that are unrelated to the number of questionnaires answered. While some users reported either always or never having tinnitus (e.g., all of their EMA questionnaires contain either only 1 or 0 for question 1), we considered this scenario with the remaining two subsets. Users with non-permanent tinnitus (DS4) reported both the presence and absence of tinnitus and at least once, whereas users with rather absent tinnitus (DS5) reported the absence of tinnitus at least three times and the presence at least once in all questionnaires they answered. These two datasets refer to RQ3.

This leads us to 5 different sub-datasets (DS) of our dataset:

DS1::

All Users

This dataset contains all answered EMA questionnaires of users with at least 11 filled out questionnaires and no missing values. It contains 45,935 filled-out questionnaires from 518 different users.

DS2::

Power Users

This data set contains all answered EMA questionnaires of users with 400 or more filled-out questionnaires. It contains 14,743 questionnaire answers from 22 different users.

DS3::

Normal Users

This data set consists of 31,192 answered EMA questionnaires from 496 users with less than 400 filled-out questionnaires.

DS4::

Non-permanent Tinnitus Users

This data set represents all users that reported both presence and absence of tinnitus at least once. This leads to 34,998 questionnaire answers from 361 different users.

DS5::

Rather absent Tinnitus Users

As a subset of DS4, this data set corresponds to all users that reported the absence of tinnitus at least 3 times and the presence at least once. 28,964 questionnaires from 280 users are part of this data set.

Table 13 Additional possible thresholds for Power User.

Of course, there are also other possible thresholds than 400 questionnaires per user to distinguish between Power Users and Normal Users (see RQ2). Alternatives can be either more restrictive, such as 600 questionnaires per user or less restrictive such as 200 questionnaires. However, the setting of this threshold affects the size of the underlying data set both in terms of number of users and questionnaires (see Table 13). On the one hand, setting the threshold to 200 questionnaires per Power User results in 51 users (9.85%) and 22745 (49.52%) questionnaires. Using the top \(\sim\) 10% of users seems to be a solid estimate; however, it would also result in \(\sim\) 50% of questionnaires. On the other hand, setting the threshold to 600 questionnaires per Power User would result in 9 users (1.74%) and 8406 (18.30%) questionnaires. In this scenario, the proportion of users would be too low, and considering only 18.30% of questionnaires seems too restrictive. Therefore, we set the threshold to 400, resulting in 4.25% of users and 32.1% of questionnaires. The threshold of 400 is close to the 95th percentile and also includes  30% of all questionnaires. Consequently, all values are a good average between 200 and 600. In addition, Fig. 2 shows the number of questionnaires per user. A separation at 400 questionnaires can be seen, which also supports the decision.

Figure 2
figure 2

Number of EMA-D questionnaires per unique user. Users with less than 11 submitted questionnaires are omitted. Normal Users submitted 11–399 questionnaires and Power Users 400 or more questionnaires.

Figure 3
figure 3

Data Preparation process.

Figure 4
figure 4

Downsampling process.

Machine learning analysis

We applied eight different machine learning approaches with the goal of predicting the presence or absence of tinnitus given an assessment in the context of the EMA-D data. In our dataset, this corresponds to the answer to question 1. The following eight approaches were applied to the different datasets: a Decision Tree (DT), a Random Forest Classifier (RFC), a Support Vector Machine (SVM), a complementary Naive Bayes classifier (CNB), a K-nearest Neighbors Classifier (KNC), a Logistic Regression Classifier (LRC), a Multi-layer Perceptron Classifier (MLP), and an Extreme Gradient Boosting Classifier (XGB). These approaches were selected because they represent a variety of different classification approaches. It should be noted that all machine learning approaches were applied at the assessment level of the EMA-D questionnaires. In other words, a user’s score may be included in both the training and validation datasets, which could introduce bias. An alternative approach would be to separate users included in either the training or validation dataset. However, if users are only part of the training or validation dataset, it is important to ensure that there are no user characteristics between the two groups of users that could lead to bias. The TYT application follows an EMA-driven approach where random, voluntary, and dynamic assessments are the main goal. As a result, it is difficult to identify sufficiently large groups of users with similar rating characteristics. Other possibilities in this context are stratified cross validation25, external validation26 or LOSO27. In further experiments, these methods will be applied to TYT data and also compared with other mHealth data sources. Beyond these considerations, the dataset was prepared as follows and before training the classifiers: In the case of downsampling, as many entries were randomly eliminated from the majority class (i.e., entries that report tinnitus), so that there were the same number of entries as in the minority class (i.e., entries that do not report tinnitus). This approach is called downsampling. The concrete steps for the downsampling approach are depicted in Fig. 4. Another approach, which we call balanced, adapts the weights of classes inversely proportional to class frequencies in the input data, if possible. This was the case for DT, RFC, SVM, XGB, and LR. Remaining approaches do not allow for weight adjustments and were therefore used with default specifications.

Finally, it should be mentioned here once again that the classifications were carried out separately on the basis of the five user groups shown above. The groups relativize the assessment-level bias, but in order to be comparable, all 5 groups were calculated at the assessment level.

Furthermore, for the KNC, we calculated the mean error for all k values between 1 and 500 and selected the k value with the lowest mean error for each data set. In addition, there is a intuitive relationship between question 1, question 2, and question 3 (see Table 11). Question 8 is dynamic in the sense that the question may vary from user to user depending on the worst symptom collected in a previous questionnaire. Therefore, including questions 2, 3, and 8 in models predicting question 1 would introduce bias. To avoid this bias, we exclude questions 2, 3, and 8 from the machine learning analysis. In short, we predict the answer to question 1 based solely on the answers to questions 4, 5, 6, and 7.

To further validate our results, each reported result is based on a 10-fold cross-validation. In this process, the entire data set was divided into 10 equal parts. Nine of the ten parts were used to train the model, while the remaining part was used for testing and optimization. This is then repeated 10 times, whereas each repetition uses a different part of the data set for testing and the remaining 9 parts for training. To address the imbalance in our data, we split the data using a stratified strategy. This ensures that both classes are correctly represented in both the training and testing sets. For classifiers that support this option, we also adjusted the weights of the two classes to account for the imbalance. By adjusting the weights to be inversely proportional to the class frequency, we ensure that the classifiers do not ignore any classes. The whole procedure was repeated 10 times and the averages were calculated over all 10 runs. To corroborate our results, we applied the above procedure to all 5 data sets shown in Fig. 3, and on each of the eight machine learning approaches.

Six different metrics are used to evaluate the results: Accuracy, the weighted F1 score, the area under the Receiver Operating Characteristic Curve (AUC), Precision, Sensitivity, and Specificity. All analyses were performed in the following environment: a laptop with an i7 core (2.60 GHz) and Python scikit-Learn.

Ethics approval

The study was approved by the Ethics Committee of the University Clinic of Regensburg (ethical approval No. 15-101-0204).

Informed Consent

All users read and approved the informed consent before participating in the study.