A synthetic data generation system for myalgic encephalomyelitis/chronic fatigue syndrome questionnaires

Artificial intelligence or machine-learning-based models have proven useful for better understanding various diseases in all areas of health science. Myalgic Encephalomyelitis or chronic fatigue syndrome (ME/CFS) lacks objective diagnostic tests. Some validated questionnaires are used for diagnosis and assessment of disease progression. The availability of a sufficiently large database of these questionnaires facilitates research into new models that can predict profiles that help to understand the etiology of the disease. A synthetic data generator provides the scientific community with databases that preserve the statistical properties of the original, free of legal restrictions, for use in research and education. The initial databases came from the Vall Hebron Hospital Specialized Unit in Barcelona, Spain. 2522 patients diagnosed with ME/CFS were analyzed. Their answers to questionnaires related to the symptoms of this complex disease were used as training datasets. They have been fed for deep learning algorithms that provide models with high accuracy [0.69–0.81]. The final model requires SF-36 responses and returns responses from HAD, SCL-90R, FIS8, FIS40, and PSQI questionnaires. A highly reliable and easy-to-use synthetic data generator is offered for research and educational use in this disease, for which there is currently no approved treatment.

fasciitis plantar, degenerative vertebral disease or mechanical, shoulder tendinopathy, multiple chemical sensitivity, epicondylitis, carpal tunnel syndrome, osteoporosis, hypercholesterolemia, hypertriglyceridemia, vascular risk, endometriosis, thyroiditis, with a higher prevalence than that observed in patients not affected by ME/CFS 6 .
In the study of ME/CFS, after the diagnosis and assessment of comorbid phenomena, it is essential to quantify and assess fatigue, quality of life, or anxiety/depression psychopathology using a battery of clinically selfadministered questionnaires.Today there are few units specialized in ME/CFS in the world, with a relatively low number of duly documented cases and a lack of publicly available data compared with other disorders.Moreover, unfortunately, there are no commercially available diagnostic tests, no specific lab biomarkers, and no targeted FDA-approved drugs for ME/CFS 7 .Therefore, each subject to be diagnosed with ME/CFS must undergo a Fukuda criteria evaluation and procedure that each unit has established using batteries of validated self-administered questionnaires.As stated before, it is important to evaluate the disabling fatigue perception, sleep problems, and health-related quality of life using self-administered questionnaires such as the fatigue impact scale FIS40 8 and FIS8 9 , PSQI 10 , and Short Form Health Survey (SF-36) 11 , Symptom Checklist-90-revised (SCL 90 R) psychological inventory 12 , hospital anxiety and depression scale (HAD) 13 .Ongoing placebo-controlled clinical trials to evaluate the clinical benefits of drugs on ME/CFS symptoms 14 have changed some questionnaire scores from baseline to final study as a primary endpoint.
There is no consensus on the number and type of questionnaires that should be carried out, so not all units record the same number per subject.Consequently, it is complex for ME/CFS units to have many records of the questionnaires necessary to efficiently approach large longitudinal and multicenter studies of patients with this pathology using the latest advances in data analysis, such as Machine Learning techniques.
Machine Learning is a particular method of data analytics that automates model building as it relates to the development of models.Over the last years, it has been proven great performance of machine learning supervised algorithms in several clinical applications 15 to diagnose and treat diseases.Supervised learning involves training machine learning-based algorithms using labeled input datasets requiring however, to be efficient and get optimal results, a large number of records are needed.The learning occurs by comparing results with the expected outputs to identify errors and change the model's weights to infer knowledge.There are few publications, and all of them very recent, that refers to the application of different machine learning techniques in ME/ CFS: seeking a new biomarker 16 , clustering 17 , or discovering the relationship between depression and ME/CFS 18 , using neural networks seeking omic biomarkers 16 or neural networks classifiers 19 .While they all make important steps forward in understanding ME/CSF, the limited sample size makes generalization and translation of their findings to clinical practice or other datasets difficult.Also, as stated before, when there are no clear biomarkers to follow the evolution of the illness, like in ME/CSF, quality-of-life questionnaires are used to measure it 14 .There are several lines of investigation, such as clustering 20 or finding relations between blood measurements with questionnaires data 21 .
Therefore, there is an increasing demand to access large repositories of high-quality health datasets for better and more reliable predictions from supervised machine learning algorithms.Anonymized electronic health records are bought and sold by insurance 22 and clinical groups 23 .However, they are limited in size or content, might be incomplete, and their applications might be restricted.This problem can be overcome using synthetic datasets coming from simulations 24,25 .Synthetic datasets are generated to create data for improving the sample size of existing cohorts or filling in the missing values, preserving privacy while keeping the real data characteristics.Synthetic data generators preserve the statistical properties of the original.However, they do not reveal any information regarding real people and offer several benefits, such as overcoming real data usage restrictions of data sharing and patient consent.There is a need for developing synthetic datasets that would complement real-world data for various reasons 26 : ease of access, cost-efficiency, test-efficiency, patient privacy protection, completeness, and validation capabilities, handling missingness, complex interactions between variables, resulting sensitivity analysis statistics from latest classifiers and graphical modeling and resampling 27 .A common application of synthetic data generation in medicine is image generation simulating diseases.It helps to test and benchmark the performance and accuracy of different algorithms.Some recent applications are in the simulation of skin lesions 28 , brain atrophy in aging or Dementia 29 , generation of PET MRI scans for Alzheimer's disease 30 , tumor generation in the brain 31 , or breast cancer 32 .
and Carruthers criteria 34 , and provided signed written informed consent and ethics committee approval.The data collected were anonymized in a database to which only those designated for the study had access, and in no case was any information known that could reveal or infer the participant's identity.

Relationship graph between questionnaires.
Graph theory was used to analyze the relationships between the subscales of each questionnaire.A graph is a collection of nodes (also called vertices) joined together in pairs by edges (undirected) or arcs (directed) 35 .The graph structure allows us to capture the pattern of interactions between the nodes (individuals or entities).Graph (or network) analysis is used to study relationships between individuals to discover knowledge about global and local structures.The study of structure networks helps to decide the optimal order 36 .
In this work, the graph nodes are defined as all subscales, and the edges are defined as moderate or strong correlations between nodes (subscales).The linear correlation between two subscales is represented by corr i, j , and Pearson correlation is defined as moderate or strong if corr i, j ≥ 0.5 37 in case of direct correlation.An edge i, j is defined if abs corr i, j ≥ 0.5.
The relation of subscales between each test is related in Table 2.Each subscale has been classified according to the area to which it has been defined and named as the subject.Thirty-eight subscales, six tests, and twelve subjects form the dataset to create the relationship between them to the graph.
The study of the relationships mentioned above should indicate the order to generate our machine learning models.The SF-36 is prevalent and will be used in our model as initial data.The rest of the order will be given by the relationships between the different tests so that those with a stronger relationship are consecutive in the model.The strength of the relationship is measured in terms of the percentage of connections between the test nodes.

Model architecture.
Real data of all of six questionnaires are required to train and build the models.First, an input matrix represents the validated answers of a number of patients, where n is the number of validated responses and f the number of questions.That is the first training data.As predicted, it has to be a second questionnaire which the same n and f 1 questions.The model must generate a predicted matrix with the same dimension.The next step has as input matrix the initial matrix concatenated with the last predicted matrix and the second questionnaire response matrix for prediction, as shown in Fig. 1.

Machine learning algorithms.
Classification and regression models can be used.The goal is to provide 186 output dimensions that must be calculated step by step.The output is compared with the real data set to validate the model.The results are validated using the t-student test.The strategy is to validate one questionnaire.The next step is concatenating the questionnaire answers matrix as input with the final output with different models.It has tested machine learning and deep learning algorithms step by step.The validation system has measured whether real and synthetic data come from the same populations within t-student statistics.The models tested have been regressors and classifiers.The comparison between XGBoost and Deep Neural Networks (DNN) 38,39 shows that both models offer similar performance in structured data.
Validation metrics.The F1-score can be interpreted as a harmonic mean of precision and recall, where an F1-score reaches its best value at one and worst score at zero.The relative contribution of precision and recall to the F1-score are equal.The formula for the F1 score is: and operating, where TP is the number of true positives, FN is the number of false negatives, and FP is the number of false positives.Better performance means lower FN and FP values, and better precision and recall mean better F1 max (rel i /rel j ) for each i, j∀i, j ∈ [1, n]in n test rel i = number of nodes of test i related with nodes of test j rel j = number of nodes of test j www.nature.com/scientificreports/performance.In imbalanced data, greater accuracy than the F1 score indicates that some labels perform poorly.
Recall is defined by the ratio recall = tp tp+fn 40 .Accuracy is defined if y, y as a (sample, predicted), then the frac- tion of correct predictions over samples is defined as Mean error is defined as the ratio of overall value questionnaire predicted versus comprehensive value sample questionnaire.The series for t-student value is defined as the sum of all answers of each questionnaire variable, predicted, and sample data.

Ethics approval.
The authors declare that the procedures followed were by the regulations of the responsi-

Results
Relationship across questionnaires.In our proposed model, an edge i, j is defined if abs corr i, j ≥ 0.5 which indicates moderate or strong direct and indirect correlation.The 2370 registers were validated, and the Pearson correlation analyzed 38 questionnaire subscales.The subject of each subscale represents networks with each node (9) shown in Fig. 2. Mental, depression, and anxiety are strongly correlated with physical subjects.SF-36 emotional subscales are relational with anxiety, depression, and mental subscales (SCL 90 R and HAD questionnaires).As can be seen, HAD and SCL 90 R are strongly correlated.The node's size is related to the degree of the node, i.e. the number of incident edges.
In supplementary material, the second network analyzes the subscales as a node and the same relationship as an edge.The SF-36 subscales (green) have strong relationships with HAD (magenta) and FIS8, and FIS40 (strong-green and red, respectively).SCL 90 R (brown) has a strong relationship with HAD.Furthermore, PSQI (blue) has no relationship except the total psqi value.The strength of the relationship is measured in terms of the percentage of relationships between the test nodes.The initial test is SF-36, and its nodes have relationships with 100% of HAD's nodes (3 of 3) and only 25% of SCL 90 R (4 of 12).HAD's nodes have a 100% relationship with SCL 90 R's nodes.SCL 90 R has a relationship with the unique FIS8 node, which has relationships with all four FIS40 nodes.The last test with few relations is PSQI.Consequently, the order decided according to aforementioned relationships is: HAD, SCL 90 R, FIS8, FIS40, and PSQI.

Best model selection. A test comparison between XGBoost, Classifier and XGBoost Regressor using SF-36
as training data and HAD as a target with 2321 validated registers, is provided in supplementary material Fig. 3.The hyperparameter defines how our model works 41 .The parameters tuned were max_depth, gamma, reg_alpha, reg_lambda, colsample_bytree, min_child_weight, subsample, n_estimators and eta.Hyperopt has been used for hyperparameter tuning 41 .Both must be trained for each question; therefore 14 models have to be trained.The order on a set predicted value is {0, 1, 2, 3}, and the trained value is {1, 2, 3}, where in both cases, greater values show worse health status.Regressor predicted rounded to compare between real data.The results of the model are analyzed with XGBoost and the regression and classification are compared.The mean regression error is much higher than the classification error (32.50% vs. 3.16%).Therefore, the regression model is discarded in the following analyses (the results are available in the supplementary material, Table S1).Total connections have been 32,494 (2321 registers × 14 questions HAD questionnaire) and "1" and "2" answers are 67.25% of the total.The model tends to reduce the mean error, so the model predicted 70% more "1" than real and rare predicted, "3" (For more information, see Table S3 in the Supplementary Material).
Imbalanced data occur where one or more class labels have a very high number of observations, and the other has a lower one.The main problem is to increase accurate predictions of the minority class.To consider the skewed distribution of classes of different weights, classes with weights result in a penalty and a minor update of the model coefficients.The model based on the Keras library is more flexible, and for each question, it can be considered as the difference of the unbalanced data.The main difference between the Keras classifier model is the usage of the recall value, which helps to reduce the aforementioned problem with imbalanced data (for more information, see Table S3 in the Supplementary Material).For each class, do where n is the number of valid registers, classes is the number of classes, and count i is the support of ith class.
Results comparison 1st questions of HAD (for more information, see Table S1-S3 in the Supplementary Material).The answer "0" has 66 (2.8%) support, and the answer "3" has 562 (24.21%) support.Minority-weighted label classes tend to be underrepresented with a low recall rate, 0.00 in the first case.These biases produce worse synthetic quality data for posterior analysis.Table 3 shows the results once corrected by the configuration in our model, improving the results significantly in those responses with low representation.

Model results.
Building the model needs five steps, as depicted in Fig. 3.The first step requires an SF-36 questionnaire input matrix with 3019 registers which HAD questionnaire had the same.The output is a HAD     6) limitations in usual role activities because of emotional problems; (7) vitality (energy and fatigue), and (8) general health perceptions and is one of the most used quality life questionnaires used and evaluated 41 .The other five questionnaires used in this work complement most information about the quality of life of ME/SFC patients.
The questionnaires can be answered quickly and are regularly available in primary care and specialized medical consultations.Some applications offer automated analyzed results that inform essential information about patient health conditions.
The graph theory has been used to decide the order of the modeling cascade.Although a deeper analysis of these relationships should be the subject of another, more specific work, in this case, it informs us of the order used in our model.These relationships will characterize our model, which will be more robust with more records analyzed.Our dataset is unusually great in SFC, which becomes robust to our models.
Our synthetic dataset generator applications fill in missing data of real datasets from any other five questionnaires.For those, ME/SFC dataset clinical units with SF-36 questionnaire answers but missing others could build a complete dataset.

Limitations
(1) Single-center trial.(2) Unit of reference in diagnosing and treating CFS/ME, which may be biased towards more severe cases and a longer evolution time than studies in primary care.(3) No information is available on parameters such as the results of the two-day ergometric test for assessing exercise intolerance, a neuropsychological battery for assessing cognitive impairment, and neurovegetative dysfunction, e.g., heart rate variability.(4) That this is a prospective study with cross-sectional data collection.It is not a longitudinal study.

Conclusion
Synthetic patients can be simulated with models of ME/CFS questionnaires data and corresponding standards of care to produce risk-free realistic synthetic healthcare records at scale.An open-source generator offers high-fidelity synthetic data for investigation and educational use, free of legal, privacy, security, and intellectual property restrictions.
ble Clinical Research Ethics Committee and by those of the World Medical Association and the Helsinki Declaration.The research protocols were approved by the Ethics Committee of the Vall d'Hebron University Hospital, the first "Population-based Registry of Patients with Chronic Fatigue Syndrome" approved on 18/10/2006.

Figure 1 .
Figure 1.Modeling schema.The model requires a questionnaire answers matrix as an input value.The output is the other five questionnaires.Each questionnaire has a different number of questions and subscales.A simple sum of the number of questions calculates most subscales.For example, the SF-36 input dimension is n × 36, where n is the number of patients who answered the SF-36 questionnaire.The output is n × 186, where 186 is all five questionnaire answers.

Figure 2 .
Figure 2. Subscales relationship graph.Color nodes represent the test to which nodes belong.The percentage in the legend represents the number of nodes versus the total.
See Table 1 for final records.Patients were eligible to participate if they were 18 years, had a confirmed diagnosis of ME/CFS, met the Fukuda

Table 1 .
Available data for each questionnaire and the questionnaires' characteristics.From left to right, each column title means Registers: number of available forms.Questions: number of questions per questionnaire.Subscales: number of defined subscales.Total Value: The questionnaire has a unique resume value.Answers' rank: Possible answer value for each question.

Table 2 .
Subscales and subject definitions for questionnaires.Test: Each of the analyzed questionnaires.Subscales: Every dimension defined in every questionnaire.Subject: Area that is associated with each subscale.

Table 4 .
26nal model result.Each question of each step needs different parameters, so it has to train 188 models with other parameters.www.nature.com/scientificreports/syntheticmatrix.The second step requires an input matrix of SF-36 + HAD (synthetic data) and produces synthetic SCL 90 R responses and so on.The results are detailed in Tables4, 5.Given the SF-36 questionnaire data can create using a new model, synthetic responses from other questionnaires inform the impact of fatigue, psychological phenomena, and sleep dysfunction.The lack of risk-free health data is an issue in ME/SFC hospital units and investigators.This open-source project offers a tool to generate riskfree synthetic data for the health IT and clinical community to use, experiment, and create more synthetic data.The quality based on validation tests did not cover projects or research focused on clinical discovery.Synthetic data can be an alternative to ground truth when data access is restricted and an excellent alternative to machine learning training/testing datasets26.The SF-36 includes one multi-item scale that assesses eight health concepts: (1) limitations in physical activities because of health problems; (2) limitations in social activities because of physical or emotional problems; (3) limitations in usual role activities because of physical health problems; (4) bodily pain; (5) general mental health (psychological distress and well-being); ( Vol:.(1234567890) Scientific Reports | (2023) 13:14256 | https://doi.org/10.1038/s41598-023-40364-6