Phenotyping chronic tinnitus patients using self-report questionnaire data: cluster analysis and visual comparison

Chronic tinnitus is a complex, multi-factorial symptom that requires careful assessment and management. Evidence-based therapeutic approaches involve audiological and psychological treatment components. However, not everyone benefits from treatment. The identification and characterisation of patient subgroups (or “phenotypes”) may provide clinically relevant information. Due to the large number of assessment tools, data-driven methods appear to be promising. The acceptance of these empirical results can be further strengthened by a comprehensive visualisation. In this study, we used cluster analysis to identify distinct tinnitus phenotypes based on self-report questionnaire data and implemented a visualisation tool to explore phenotype idiosyncrasies. 1228 patients with chronic tinnitus from the Charité Tinnitus Center in Berlin were included. At baseline, each participant completed 14 questionnaires measuring tinnitus distress, -loudness, frequency and location, depressivity, perceived stress, quality of life, physical and mental health, pain perception, somatic symptom expression and coping attitudes. Four distinct patient phenotypes emerged from clustering: avoidant group (56.8%), psychosomatic group (14.1%), somatic group (15.2%), and distress group (13.9%). Radial bar- and line charts allowed for visual inspection and juxtaposition of major phenotype characteristics. The phenotypes differed in terms of clinical information including psychological symptoms, quality of life, coping attitudes, stress, tinnitus-related distress and pain, as well as socio-demographics. Our findings suggest that identifiable patient subgroups and their visualisation may allow for stratified treatment strategies and research designs.


Materials and methods
patients. Analyses were based on data from N = 1228 patients with chronic subjective tinnitus who had been treated at the Tinnitus Center of Charité Universitätsmedizin Berlin, Germany, between January 2011 and October 2015. All patients had been suffering from tinnitus for 3 months or longer and were 18 years of age or older. Exclusion criteria comprised the presence of acute psychotic illnesses or addictions, deafness and insufficient knowledge of the German language. Multimodal psychosomatic assessments were carried out by ENT, internal, psychosomatic and physical therapy specialists. Treatment comprised an intensive, multimodal 7-day program that included informational counselling, detailed ear-nose-throat (ENT) as well as medical and psychological diagnostics, cognitive-behaviour therapy interventions, auditory training, relaxation exercises, and physiotherapy. Patients who presented with objective tinnitus were excluded from the present therapy and treated medically, as applicable. Ethical approval was granted by Charité Universitaetsmedizin Berlin ethics committee (reference number EA1/115/15). All relevant guidelines and regulations were followed. All patients gave informed written consent for data collection. Prior to the analyses, all data had been anonymised.
Data from 2875 (70.1%) patients who did not complete all questionnaires were excluded due to missing values. Excluded patients were slightly, but significantly older than those included in the final sample ( µ excluded = 51.73 , σ excluded = 13.63 ; µ included = 50.00 , σ included = 11.91 ; t(2630.8) = − 4.07, p < 0.01 ). Since all features of the SF8 and some features of the BSF, SWOP and PSQ have higher scores with better quality of life, features with a positive wording were reversed (new value = maximum feature value − old value) so that the interpretation (higher scores represent higher burden) remained consistent. Hereafter, feature names with a *-suffix denote reversed features. Due to widely differing value ranges, each feature was standardised prior to cluster analysis via z-score normalisation to have a mean of 0 and standard deviation of 1.

Identification of tinnitus phenotypes using clustering.
To identify a distinct set of tinnitus phenotypes, the clustering algorithm X-means was employed 32 . X-means is a parameter-free extension of the popular K-means clustering algorithm that incorporates the Bayesian Information Criterion 33 (BIC) to automatically find an appropriate number of clusters K by finding a good trade-off between high goodness of fit and a low number of clusters. Let D be the dataset with d dimensions and D a subset of D , i.e., D ⊆ D . A K-means clustering on D yields the set of clusters C = {C 1 , . . . , C k , . . . , C K } , where c k is the centroid of cluster k, r k is the number of points in D assigned to c k and p is the number of free parameters, i.e., p = (d + 1) · K . The BIC of a cluster C k using Schwarz criterion is calculated as is the loglikelihood of D according to C k . The point probabilities are computed as P ( where the maximum likelihood estimate for the variance (under the identical spherical Gaussian assumption) is |D | . The X-means algorithm consists of 4 steps: (1) First, an initial K-means is run with K = K lower . (2) Then each centroid is bisected into two children which are placed in opposite directions along a randomly chosen vector.
(3) A "local" 2-means clustering is run for each pair of children and a BIC score is assigned to this new partitioning. along a randomly chosen vector. (4) If the BIC score increases with bisection of a centroid, the respective www.nature.com/scientificreports/ child centroids are kept, otherwise, the parent centroid is kept. Steps (2)-(4) are repeated until neither centroid's BIC score can be improved by bisection. We used the R implementation of the X-means algorithm of Ishioka and set K lower to 2 34 . Numbering of clusters as cluster 1, cluster 2, etc. was done arbitrarily.
cluster visualisation. Visualisation of clusters in high dimensionality is challenging. The popular scatterplot matrices (SPLOMs) and their extensions intuitively represent the relation between pairs of features as matrix where each non-diagonal element is a two-dimensional scatterplot 35,36 . However, the number of scatterplots grows quadratically with increasing dimensionality which leads to scalability problems such as overplotting. Hence, advanced visualisation techniques have been proposed as a remedy, e.g. density contours, hexagon binning, coloring, transparency, layers showing aggregated geometric characteristics (minimal spanning trees, alpha shape, convex hull), animation, or combinations of multiple techniques such as splatterplots 37 . Still, SPLOMs and other traditional visualization techniques such as parallel coordinate charts 38 are more suited for rather low-dimensional data with only a handful of features.
In case the original data cannot be adequately displayed on low-dimensional projections, dimensionality reduction (DR) is often applied as preprocessing step in advance of visualisation. DR algorithms transform a high-dimensional feature space onto a low-dimensional (often 2D) projection. Ideally, the projection preserves important structures of the original data, such as clusters, outliers, correlations and other important structures. Principal component analysis (PCA) is a frequently used DR algorithm that generates a new coordinate system with orthogonal dimensions 39 . The new dimensions (principal components, PC) are linear combinations of the original dimensions and are sorted according to variance. Each PC carries a loading that characterises how much variability of the data is explained. PCA is primarily suited for normally distributed data. However, PCA has problems with outliers and is incapable of capturing non-linear relationships. Another popular DR algorithm is multi-dimensional scaling (MDS) which puts emphasis on preserving distance. Points that are close in high-dimensional space should also be close in low-dimensional space. For complex (arbitrarily-shaped) structures, large distance is meaningless because of the curse of dimensionality, thus results may be unsatisfactory. t-stochastic neighbourhood embedding (t-SNE) is a non-linear dimensionality reduction technique that visualises a matrix of pairwise-similarities 40 . The similarities are calculated in a way to both preserve global structures (clusters at different scales) and local structures (distances and neighbours). t-SNE does not allow for interpretation of the original dimensions. Further, the technique does not support to add a new observation to the existing projection without recalculation.
DR techniques cannot be applied here because even if the clustering structure is preserved in the data projection, the semantics of the original features will be lost. Discussions with domain experts led to the following requirements for a cluster visualisation: • preservation of original features, • intuitive cluster representation for multi-variate data with dozens of features, • compact, at a glance comparison of multiple clusters, • chart design that allows to juxtapose a cluster with the overall patient mean.
We therefore introduce a new radial bar chart visualisation as graphical representation of a single cluster, enriched with dedicated elements that satisfy the aforementioned requirements. In particular, the height of a bar depicts the average value of a feature over the patients assigned to that cluster. The radial spatial layout distributes the bars around a circle where each bar starts at the black 0 line which represents the feature average over all patients included in this study. Due to the scaling (z-score normalisation) of the features, bars inclined to the outside represent feature averages above the overall patient mean and bars inclined to the inside represent feature averages below the overall patient mean. This interpretation is visually supported by colour-coded bars using a sequential gradient from dark blue (low burden) to yellow (mean burden) to bright red (high burden). Feature names are shown on top of each bar. All values are depicted in terms of standard deviation away from the mean. For example, a value of − 1 indicates that the cluster average is 1 standard deviation smaller than the overall patient average. Intra-cluster standard deviation are represented as grey error lines facing the coloured inner circle. To facilitate quick feature localisation, features were grouped into categories which are displayed in the inner circle, alongside the cluster name and the number of patients assigned to that cluster. These categories were (in clockwise order): tinnitus characteristics, physical quality of life, experiences of pain, somatic expressions, affective symptoms, tinnitus-related distress, internal resources, perceived stress, and mental quality of life.
To provide a graphical overview of all clusters at the same time, we designed a radar chart variant where feature averages are represented as points instead of bars which allows to show multiple clusters. Within each feature category, the points of a cluster are connected by line segments. Points and line segments are coloured by cluster.
Interactive components for cluster inspection. To provide a graphical overview, an interactive demo of the cluster solutions and the visualisations is available under https ://unmnn .de/phs/app/. Radar charts were augmented with interactive components: by hovering over a bar or a feature label, additional cluster summaries and compact feature descriptions are shown as tooltips. Clicking on a feature invokes an additional chart which shows the (normalised) distribution of the selected feature stratified by cluster, and if selected, also after treatment. Continuous features are shown using semi-transparent boxplots placed on violin plot 41

Results
According to X-means, four clusters (referred to as phenotypes hereafter) represent an optimal solution for the given dataset. The radial barcharts in Fig. 1 visualise phenotype-individual averages for all features. Graphical summaries of phenotype value distributions for all features on their original scales are provided in Supplementary Fig. S1. The radar plot in Fig. 2 shows average scores for each variable and for all clusters. While the four phenotypes are clearly distinguishable with respect to the psychosomatic and somatic variables, the line segments for most features of the group "tinnitus characteristics" are close to the overall patient average. Phenotype 1 (PT 1) represents the largest subgroup with 697 out of 1,228 patients (56.8%). This patient subgroup is characterised by ostensibly below-average symptom expression across tinnitus-related and broader psychosomatic symptom indices, including affective symptoms, perceived stress, tinnitus-related distress and somatic symptoms, as well as (above-average) quality of life and internal resources (Fig. 1). Due to their help-seeking behaviour, presentation in clinic and wish to participate in multimodal treatment, it can be assumed that this group of patients do experience psychological distress, however aim to present themselves as healthily as possible. We therefore label this phenotype "avoidant group". Patients in this subgroup feature proportionately high levels of education, employment and low levels of leave of absence and psychotherapeutic treatment ( Table 1). PT 2 comprised 173 patients (14.1%) who reported the highest emotional and somatic burden among all PTs (Fig. 1b). More specifically, PT 2 represents a patient subgroup with high psychosomatic-comorbidity and is thus labelled "psychosomatic group". This patient subgroup shows high tinnitus burden alongside clinically relevant impairment across all affective indices including depression, anxiety, and perceived stress. These affective symptoms appear to align with somatoform expressions of distress including somatic symptoms. Patients of these subgroup report severely reduced quality of life and reduced coping opportunities with more pessimism, less experienced self-efficacy and optimism. Patients in this subgroup feature a high proportion of patients who live alone, are unemployed or show an overall lower educational status. Patients in this cluster further appear to consult more doctors, take more leave of absence and use more psychotherapy. PT 2 patients reported the tinnitus sound to be audible in the entire head to a greater portion than the other groups. PT 3 contained 187 associated patients (15.2%) characterised by above-average scores of features measuring somatic complaints and near-average scores for affective symptoms (Fig. 1c). Since pain scores of SF8_bodilyhealth* and SSKAL_painfrequency were similarly large as PT2, this patient subset was labelled "somatic group". PT 3 represented the oldest subgroup, with the largest proportion of female patients and largest reported time period since tinnitus onset. Unlike PT3, PT 4 (n = 171; 13.9%) exhibited above-average scores for affective scores, components of quality of life and perceived stress (Fig. 1d), e.g. mental component summary score (SF8_mentalcomp*; 0.85) and anxious depression score (BSF_anxdepression; 0.79). Hence, we label PT 4 as "distress group". PT 4 comprises the youngest of the 4 subgroups, with the largest fraction of male patients (Table 1).

Discussion
In this study, we combined data-driven clustering with a novel visualisation to identify and display distinct phenotypes in a large sample of patients with chronic tinnitus. Patient data were extracted from self-report questionnaires prior to starting a multimodal treatment program. Our analysis suggests four phenotypes of patients with chronic tinnitus. PT 1 (avoidant group) represents a large proportion of patients. Apart from the tinnitus symptom, patients in this subgroup reported few other affective or psychosomatic symptoms, and the tinnitus is used as an indexrepresentation of experienced distress. Due to these patients' focused presentation ("everything is okay were it not for the tinnitus"), clinicians can easily be led to believe that potential other contributors to individual distress must not require assessment. However, clinical experience strongly suggests that a thorough assessment of broader psychosocial stressors is warranted in so far as it is feasible in clinical practice environments. The psychosocial resourcefulness of this subgroup enables patients to seek help quickly and in a solution-focused manner. Good tinnitus-specific counselling and individualised (online) therapy modules featuring audiological, psychological or relaxation procedures would possibly represent an adequate treatment strategy for this patient subgroup.
PT 2 (psychosomatic group) represents 15% of patients who showed high tinnitus burden alongside clinically relevant impairment across all affective indices including depression, anxiety, and perceived stress. These affective symptoms appear to align with somatoform expressions of distress including physical complaints and somatic symptoms. Patients of these subgroup report severely reduced quality of life and reduced coping opportunities with more pessimism, less experienced self-efficacy and optimism. There is the frequently asked question as to whether increased tinnitus-related distress contributes to increases in depression or vice versa. In this group, we consider depressive or anxious symptoms to be a crucial underlying factor for general symptom burden and treatment must begin with a focus on improving mood and relieving depression. Here, tinnitus-related distress needs to be seen within a broader context of medical and psychological influencing factors that require idiosyncratic conceptualisation. According to the socio-demographic variables, this patient subgroup features a higher proportion of women, and more patients who live alone, are unemployed or show an overall lower educational status. Patients in this cluster further appear to consult more doctors, take more leaves of absence and use more psychotherapy. www.nature.com/scientificreports/ PT 3 (somatic group) appears to represent a patient subgroup that is characterised by somatopsychic symptom expressions, i.e., physical symptoms that may reflect distress and/or underlying medical conditions. To adequately address the needs of this patient subgroup, multimodal interventions might include a proportion of body-oriented procedures such as relaxation exercises or physiotherapy whose effect, however, should be interpreted with regard to both direct and indirect psychological effects (e.g. through increased senses of wellbeing or others' care).   The height of a bar shows a feature's z-score normalised within-cluster average, and the grey line centred at the top of the bar illustrates the 95% confidence interval. The colour of a bar represents the difference of the within-cluster average from the overall patient average (PA), from − 1.5 SD below PA (dark blue) to PA (yellow) and + 1.5 SD above PA (bright red). Features were grouped into 9 categories defined by tinnitus experts. The categories are shown within the inner circle. See subsection Features for a description of each questionnaire and the extracted features.

Scientific RepoRtS
| (2020) 10:16411 | https://doi.org/10.1038/s41598-020-73402-8 www.nature.com/scientificreports/ Patients in PT 4 (distress group) reported above-average perceived stress, accompanied with physical exhaustion and anxious-depressive mood. This group includes rather younger, employed patients (more men), who indicated chronic distress, potentially being susceptible to burnout syndrome with subjective reduced mental capacity ("hamster wheel"), which is used as a description of the life situation even without tinnitus stress. In this subgroup, tinnitus might represent chronic stress associated with psychological vulnerabilities, environmental/ work stressors and dysfunctional coping strategies. Multimodal therapy should initially focus on stress-regulation techniques, including relaxation or individually tailored behavioural modification approaches. Similar to the highly psychosomatically burdened PT 2, patients in PT 4 could also benefit from longer psychotherapeutic or multimodal treatment procedures (inpatient or rehabilitative).
The overview and juxtaposition of all clusters shows that some questionnaires and characteristics contribute a lot to differences among patient phenotypes. In particular, patient phenotypes differ substantially with respect to their coping attitudes, their stress and their perception of quality of life, as well as their tinnitus distress. Some of the questionnaire items separate well among some of the phenotypes, see e.g. the items on perceived pain and complaints. In contrast, patients do not seem to differ in their perception of tinnitus. These contributions of the questionnaires to the phenotypes indicate that phenotyping may be achievable also with less questionnaires, especially because some of the questionnaires are overlapping.
Previous studies also employed clustering algorithms to identify tinnitus subtypes 13,14,16,17 . It is difficult to compare our findings to theirs because of the different set of available measurements: whereas the strength of our study was a large pool of self-report questionnaire data, Tyler et al. used both self-report data and audiometrics 14 , Schecklmann et al. used self-report data and cordiac imaging features 17 , and Langguth et al. used audiometric data 13 . Nevertheless, PT 2 (psychosomatic suffering group) appears to match the "constant distressing tinnitus" subgroup reported by Tyler et al. 14 , as average scores on features measuring tinnitus-related health burden were distinctly greater than in the other subgroups. Of course, the selection of meaningful features is pivotal for the efficacy of any cluster analysis. Schlee et al. argued against the usage of single-item features like visual analogue scale measurements because of their higher susceptibility to random measurement errors, lower test-retest reliability and higher vulnerability to unknown biases 18 . Due to the exploratory nature of our study, we decided to include both single-item measurements and compound scores. We assigned 11 out of 15 single-item measurements into the category "Tinnitus characteristics". Figure 2 (top-right) shows their low discriminative power, as all phenotypes' means are close to the population average, with the exception of TINSKAL_impairment and TINSKAL_loudness. Future research might focus on identifying a subset of key questionnaires alongside a simple computational tool that will enable clinicians to match individual patients with one or more of the here identified phenotypes.
Closest to our radial bar chart visualisation is the radar chart proposed by Schlee et al. 18 . Their solution indeed facilitates the comparison of two subgroups by comparing the areas of their associated polygons. However, there www.nature.com/scientificreports/ is still the potential problem of overplotting when one wishes to compare more than 2 subgroups as we do. Hence, we did not opt to fill up the areas spanned by the connected points with colour, to avoid a polygonal that fully overlays another one. Further, since the main criterion for comparison is the polygonals' shapes, inferences highly depend on the ordering of features around the plot which can be misleading. Schlee et al. 18 tackled this problem by computing an ordering that yields areas that achieve maximum mean surface difference between subgroups and minimum surface variance within subgroups. This approach is feasible for up to a moderate ( ≈ 20 ) number of features. In our study with 64 features, we decided to arrange features based on semantic categories, e.g. quality of life. This allows to detect and track features easier. Further, our visualisation is not specific to tinnitus but could be used to present a compact visual summary of characteristics of any condition or index symptoms subgroups.
Whether the visualisations will be adopted by clinicians for finding suitable tinnitus management strategies needs to be tested. Preliminarily, clinicians suggested that graphical summaries of possible patient subtypes may alleviate allocation of modular treatment strategies to specific combinations of symptom presentations. A potential limitation of our analysis is the exclusion of patients who did not fill all questionnaires during admission. There are several reasons why a patient may not have filled all questionnaires, including unfamiliarity with the technical devices (the questionnaires must be filled electronically), loss of motivation due to the relatively large number of questionnaires and collision with baseline examinations in the lab. The exclusion of these patients may have led to a selection bias. Nonetheless, our analysis over all 15 questionnaires allowed us to acquire insights to the contribution of these questionnaires to phenotyping, so that eventually a reduction of questionnaires might become possible. Further, our analysis is a static snapshot of phenotypes at baseline. Hence, it might be sensitive to possible changes in tinnitus perception and associated health burden over time. It is possible that a patient will transition from one phenotype to another in later stages of her life or depending on her www.nature.com/scientificreports/ tinnitus management. Thus, a next step would be to study the effects of treatment to these phenotypes and find whether some patient phenotypes benefit more than others. Another limitation comes from the heuristic choice of the number of phenotypes. Many works use the number of clusters as input parameter. Since this number is not known, we used the non-parametric X-Means clustering algorithm.

Data availability
The datasets for this article are not publicly available because no consent of the patients to publish their data was obtained. Notwithstanding, interested researchers can contact the directorate of the Tinnitus Center Charité Universitaetsmedizin Berlin with data access requests addressed at the senior author [birgit.mazurek@charite.de].