Introduction

Headache disorders are the most common neurological symptoms and have a substantial impact on sufferers. A proper diagnosis of headaches is essential for its treatment. Currently, the diagnosis of headache disorders is highly dependent on self-report from patients and the interpretation of the self-report by clinicians. The International Classification of Headache Disorder (ICHD) was published to aid a standardized diagnosis of headache disorders1. The ICHD has three chapters including primary headache disorders, secondary headache disorders, and painful cranial neuralgias/facial pain. These chapters include a number of disorders and their subtypes. However, its clinical application may be challenging for physicians who are inexperienced in headache medicine.

There have been efforts to aid the diagnosis of primary headache disorders using neurophysiological tests2, neuroimaging3,4, and blood-based biomarkers5,6; however, these have not replaced clinical interviews. Previous studies have mainly focused on migraine with little focus on the differential diagnosis of other headache disorders7,8. Recently, a simple questionnaire for the screening of migraine was developed and validated for research purposes9. The clinical diagnosis of headache disorders should, however, be based on a holistic approach since a single characteristic cannot replace the proper diagnosis.

Recently, data-driven approaches using machine learning or deep learning have been tested in the medical field to avoid biases attributed to human factor10,11,12. These approaches have been used mostly for neuroimaging analysis in headache research3,4. In this study, we aimed to analyze self-reported symptoms of patients to classify four headache disorders including migraine, by using machine learning approaches. Real-world questionnaires obtained from more than 2000 patients were used for this study.

Methods

Subjects

This study was approved by the institutional review board (IRB) of the Samsung Medical Center (IRB 2018-10-029). Written informed consents were obtained from patients or their guardians. Our study was performed in full accordance with local IRB guidelines. A total of 2162 patients who visited our headache clinic for the first time between January 2017 and December 2018 were included in our prospective headache clinic registry. The registry was retrospectively screened for this study. All patients completed structured questionnaires. Based on the questionnaire and clinical interview, the diagnosis of headache disorders was made using the ICHD-3 beta or ICHD-3 (whichever was the most updated at the time of visit) by headache specialists (MJL with 10 years of experience and C-SC with 30 years of experience).

Headache clinic registry

The questionnaire for assessing patients on their first visit was developed by a headache specialist (MJL) and has been used in the Samsung Medical Center headache clinic since 2015. The questionnaire consists of 75 screening questions (Supplementary Table S1) including headache characteristics (e.g., intensity, location, nature of pain, and aggravation during or avoidance of physical activities), disease course (onset, the mode of onset, and the time of aggravation), associated symptoms (e.g., nausea, vomiting, photophobia, phonophobia, osmophobia, autonomic symptoms), aura, information regarding the medication used for headaches, past medical history (e.g. hypertension, diabetes, insomnia, depression, anxiety, and others), and social history (e.g. caffeine intake, smoking, alcohol consumption, and occupation). In addition to data from the questionnaire, patient demographics including age, sex, and body mass index (BMI) were prospectively recorded in the headache clinic registry and used for the analysis in this study. The ICHD-3-based diagnosis of each patient was coded in the registry.

The data of 2018 were used as the training cohort and the data of 2017 were used as the test cohort. There was no overlap between the two cohorts because the questionnaire was evaluated only for new patients. For the analysis, we merged headache disorders with similar entities into seven groups: migraine, tension-type headache (TTH), trigeminal autonomic cephalalgia (TAC), epicranial headache (including primary stabbing headache and occipital neuralgia), thunderclap headache ([TCH] including primary and secondary causes of TCH), other primary headache disorders, and secondary headaches other than those causing TCHs. Among these, we excluded other primary headaches (n = 49, training cohort) due to the high heterogeneity in the subtype. Secondary headache disorders other than TCH (n = 122, training cohort) were also excluded because this subtype presented with heterogeneous diseases that required diagnoses from the clinical course rather than headache characteristics. Data from 2162 patients, including the training cohort (n = 1286) and test cohort (n = 876), were finally used for this study. Further details of the patients are given in Table 1.

Table 1 Distribution of primary headache subtypes.

Stacked classifier model

We adopted a stacked model that consisted of four layers of binary XGBoost13 classifiers as shown in Fig. 1. Each layer of binary XGBoost classifier was used to classify subjects into two groups: the target subtype and the rest. We explored all possible orders of the stacked classifier and chose the order with the best accuracy in the training cohort. The first layer classified the most dominant subtype (i.e., migraine) and the rest (i.e., non-migraine). This enabled less challenging issues to be tackled first. The second layer classified between TTH and the rest (i.e., non-TTH). The third layer classified TAC and the rest (i.e., epicranial headaches and TCH). The final layer classified epicranial headaches and TCH. Our ordering of the classifier is similar to a multi-scale approach where one starts solving a large-scale problem before progressing on to small-scale problems.

Figure 1
figure 1

Structure of the stacked classifier model. TTH tension-type headache, TAC trigeminal autonomic cephalalgia.

Feature selection

Each patient assessment was turned into a long feature vector. Continuous variable responses were normalized with a value between − 1 and 1. Categorical variable responses were converted to a one-hot vector. Multi-hot encoding was adopted for some categorical questions with multiple responses. Thus, the assessment of 75 questions for each patient was transformed into features with 128 dimensions. We applied the least absolute shrinkage and selection operator (LASSO)14 in choosing a few important features for each stacked classifier layer. For example, the LASSO was used to select features that can distinguish between migraine from non-migraine subtypes in the first layer. The LASSO was applied using the stratified tenfold cross-validation. From the cross-validation, features that appeared at least three times out of the ten folds were chosen. These features were chosen as the set of stable features and the threshold of three was chosen to maximize the classifier performance on average in the left-out fold in the training cohort within the tenfold cross-validation. The final model was re-trained using the stable features from the entire training cohort.

Classifier performance

The selected stable features were used to train the stacked XGBoost classifier. The trained classifier was evaluated on the independent test cohort. Sensitivity, specificity, and accuracy were assessed to quantify the performance of the classifiers in both cohorts. The classifiers were also evaluated using minimum sensitivity and specificity among the subtypes, to provide summary statistics over the subtypes. A confusion matrix was also provided.

Comparison with other methods

To ensure the methods used in our study are well-suited in classifying headache subtypes, we compared our feature selection method (LASSO) with support vector machine recursive feature elimination (SVM-RFE)15 and minimum-redundancy maximum-relevancy (mRMR)16 approaches. The numbers of the selected features using mRMR and SVM-RFE for each classifier layer were fixed as those of LASSO. The selected features were fed into the stacked XGBoost classifier. We also compared XGBoost with other binary classifiers such as k-nearest neighbor (k-NN), support vector machine (SVM), and random forest in each of the stacked layers with features selected by LASSO.

Results

Selected features

The feature selection procedure led to 32, 19, 6, and 22 features that corresponded to the first, second, third, and fourth layers of the stacked classifier from the training cohort (Table 2), respectively. Table 2 showed selected features positively correlated with the corresponding target subtypes. The top three prominent features in the first layer (migraine vs. non-migraine) were mode of onset: gradual, female sex, and absence of lacrimation. The top three prominent features in the second layer (TTH vs. non-TTH) were mode of onset: gradual, nature of pain: vague/cloudy, and cognitive complaint during headache attack. The top three prominent features in the third layer (TAC vs. specific headache syndromes including epicranial headache and thunderclap headache) were headache attack during sleep, headache triggered by upset stomach, and conjunctival injection. The top three prominent features in the fourth layer (epicranial headache vs. thunderclap headache) were location: retroauricular, nature of pain: electric shock-like, and nature of pain: jabbing, assuming epicranial headache as the positive subtype in the specific headache syndromes classifier.

Table 2 Selected features from different layers of the classifier.

Classifier performances

The performances of the classifiers for both cohorts were given in Table 3 and the performance using the confusion matrices were in Tables 4 and 5. The stacked XGBoost classifier using the selected features attained an accuracy of 82%, sensitivity of 87%, 66%, 85%, 65%, and 64% for the five subtypes, and specificity of 94%, 54%, 58%, 63%, and 57% for the five subtypes in the training cohort. The baseline accuracy (i.e., assigning all cases as the dominant subtype) was 67%. The stacked XGBoost classifier using the selected features led to an accuracy of 81%, sensitivity of 88%, 69%, 65%, 53%, and 51% for the five subtypes, and specificity of 95%, 55%, 46%, 48%, and 51% for the five subtypes in the test cohort. The baseline accuracy (i.e., assigning all cases as the dominant subtype) was 68%. Our approach performed better (p value < 10−8) than the baseline naïve classifier in the test cohort using Fisher’s exact test.

Table 3 Classifier performance of both cohorts.
Table 4 Confusion matrix for the training cohort. The bold numbers in the main diagonal denote correctly classified subjects.
Table 5 Confusion matrix for the test cohort. The bold numbers in the main diagonal denote correctly classified subjects.

Comparison of feature selection methods

Our feature selection method (LASSO) was compared to SVM-RFE and mRMR approaches. The numbers of the selected features using mRMR and SVM-RFE for each classifier layer were fixed as those of LASSO. The selected features were fed into the stacked XGBoost classifier. An overall accuracy, minimum sensitivity, and minimum specificity of the stacked XGBoost classifier were evaluated in the test cohort. Table 6 showed that features obtained through LASSO led to the best performance in the test cohort.

Table 6 Comparison of the proposed method with other feature selection methods in the test cohort in terms of classifier performance. The bold values indicate the highest score in each performance metric.

Comparison of binary classifiers

We compared XGBoost with k-NN, SVM, and random forest classifiers in each of the stacked layers in terms of overall accuracy, minimum sensitivity, and minimum specificity. The evaluation was performed in the test cohort (Table 7). The same features chosen from the feature selection stage using LASSO were used for all the classifiers. Although there were small differences in classifier performances, XGBoost still outperformed the other classifiers.

Table 7 Comparison of the proposed method with other classifiers in the test cohort. The bold values indicate the highest score in each performance metric.

Discussion

In this study, we applied a machine learning approach to classify major headache disorders using questionnaires completed by patients in a real-world setting. We found that machine learning is applicable in analyzing questionnaires. The performance of the machine learning approach in the classification of migraine was excellent however, its accuracy in classifying headache disorders other than migraine was inferior to that in classifying migraine. Nonetheless, our automated classification results could be still meaningful as the gold standard for the diagnosis of headache is a manual skillful application of the current classification criteria (currently ICHD-3, published in 2018). In the era of ICHD-3, there have been no studies evaluating the reliability and accuracy of the diagnosis of primary headache disorders made by primary care providers or general non-headache neurologists. Furthermore, there has been no classification methods other than ICHD-3.

Our study is one of the first studies to apply machine learning in the analysis of patient-reported questionnaires to classify primary headache disorders7. The diagnosis of headache disorders requires a skillful interview with patients and a comprehensive decision algorithm. We tested whether machine learning can substitute the role of the clinical interview. However, the samples of each headache disorder other than migraine and TTH were insufficient for the training. Headache disorders or syndromes other than migraine and TTH were merged into broader categories such as epicranial headaches or TCHs, which was not ideal for the detailed classification of second- or third-digit ICHD codes. In addition, secondary headaches other than those causing TCHs were excluded from the analysis since they cannot be incorporated into one entity. Secondary headaches should be diagnosed by clinical courses and causative workups rather than headache features. Taken together, our approach could not replace physician-based diagnosis due to insufficient results. However, this study demonstrated the feasibility of developing a better algorithm-based automated classification for headache disorders. Besides, our results might be used to inform or assist physicians by pre-screening with the most important factors of the stacked classifier (i.e., Table 2) or increasing the accuracy of less-specialized providers.

Our approach adopted a stacked XGBoost classifier that resulted in an overall accuracy of 81%, sensitivity and specificity of over 87% in the diagnosis of migraine. Our results were superior to the results from a previous study in which more selective data were used7. Existing studies on the classification of headache disorders with machine learning have focused on a few selected headache disorders such as migraine and tension-type headache due to challenges with sample size7,8. Previous studies used the random forest for classification however, our study adopted the XGBoost. XGBoost belongs to the boosting classifiers in which both the variance and bias of the classifier is reduced, while random forest belongs to the bagging classifiers in which only the variance of the classifier is reduced13. XGBoost has shown improved performance in many recent machine learning challenges where high-dimensional features were involved. The performance of XGBoost in classifying migraine was superior in our study because migraine is characterized by diverse features which cannot be fully incorporated in conventional statistical models, due to the complexity and challenge of multiple testing. Manual analysis even by human experts, may be time-consuming and prone to errors. However, with the automated classification algorithm suggested by this study, multiple features of headache disorders can be systematically identified. This automated classification algorithm is thus time efficient and could minimize human error in the diagnosis of headache disorders.

Our stacked classification model well reflected features of each headache disorder. Top three features used in our classification model show insights into each headache disorder when compared to the ICHD-3 criteria1. First, the mode of onset was important in migraine, TTH, and epicranial vs. TCH classifiers. This important feature should be always considered in the differential diagnosis of secondary and primary headaches, but it has not been listed in the ICHD-3 criteria for migraine, TTH, and epicranial headaches1. While migraine and TTH are typical examples of gradual-onset headaches, thunderclap onset is the most important syndrome-defining features of TCH as its nomenclature implies. For TAC, the mode of onset was not included in the classifier, as most patients with TAC experiences a relatively rapid evolution of headache attack. Second, the demographic feature was also important, while the ICHD only deals with headache characteristics. For example, female sex was ranked as the second important feature of classifying migraine. This may suggest that the female predominance is more robust in migraine than in other primary headache disorders at least in clinic-based samples. Third, the nature of pain was important in TTH and epicranial vs. TCH classifiers: vague and/or cloudy nature of pain for TTH and electric-shock like and jabbing natures for epicranial headaches. These features well reflect the nature of corresponding headache disorders, although they are different from features listed in ICHD-3 criteria1. The ICHD-3 denotes pressing or tightening quality of pain as features of TTH and stabbing, shooting, or sharp quality of pain as epicranial (primary stabbing headache or occipital neuralgia) headaches1. However, these features may be less useful in the differential diagnosis as they can co-exist in migraine attacks, TACs, and even TCHs in the real world. Fourth, the presence or absence of autonomic symptoms was important in differentiating migraine and TACs. The ICHD-3 also denotes autonomic symptoms as characteristic features of TAC1. Although autonomic symptoms can accompany migraine attacks, they are less prominent when compared to those of TACs17. Finally, sleep-awakening hypnic attacks were important in the TAC classifier. The time of headache attack has not been included in the ICHD-3. However, most of the primary headaches other than TAC tend to regress during sleep. In summary, our data showed these features can have greater relative weights in the differential diagnosis between primary headache disorders even though they are not listed as or different from syndrome-defining features in the ICHD-31.

To apply our study results to clinical practice, it should be kept in mind that secondary headache disorders were excluded in this model. This may have some clinical implications: in addition to clinical history, biochemical, radiological, or sometimes histologic evaluations are needed to rule out secondary headache syndromes. Historically, the clinical course rather than headache characteristics has been more important, whilst this cannot be easily captured by the questionnaire. Still, we explored whether automated classification was possible using the same approach. The classification performance was unsatisfactory as shown in the Supplement.

Our study has some limitations. First, the results were derived from data from a single center. Thus, our results need to be validated in an independent cohort study. Second, we applied conventional machine learning approaches in this study. Deep learning could be thought of as a high degree-of-freedom extension of conventional machine learning which has significantly improved classification performance in many domains11,18. Deep learning could be certainly applied in headache research and we believe the autoencoder network could be effective. Autoencoder network is capable of handling high-dimensional features that are correlated and can also learn low-dimensional feature embedding that is robust to noise. The features used in headache were high-dimensional (e.g., 128 dimensions) and could have a substantial correlation among them due to how the features were designed in this study. We plan to pursue research in this direction in the future.

We presented a method to classify subtypes of primary headache by fusing four XGBoost classifiers in a stacked fashion. Each classifier captured important characteristics for the target subtype in a data-driven approach. Existing studies were insufficient as they only considered fewer subtypes and reported worse classification performance than ours. Thus, although our approach was effective for the migraine subtype only, we believe our study is a first step towards building a comprehensive computer-aided diagnosis model for headaches. The software code for this study is open and can be adopted by other researchers to foster novel machine learning research in the migraine field.