Background & Summary

Multiple Sclerosis (MS) is a chronic neuroinflammatory autoimmune disease of the central nervous system (CNS). In MS, nerve fibres in the CNS are affected, resulting in demyelination and axonal damage, leading to varying degrees of loss of functional capabilities1. MS is prevalent between the age group of 20–40 years and is more common in women than in men, with at least twice as many women affected. While the exact aetiology of MS remains unclear, an interplay between genetic predisposition and environmental factors could contribute to its observed prevalence2,3,4. Moreover, due to the combination of pathophysiology, treatment and natural history of MS, people with MS (PwMS) are four times more prone to contracting infections than the general population5.

Severe respiratory syndrome coronavirus-2 (SARS-CoV-2) is the virus responsible for the 2019 novel coronavirus disease (COVID-19), which has killed more than one million people worldwide6. Given the diverse spectrum of COVID-19 disease patterns, ranging from mild to severe7,8, and the reported immune deficiencies in PwMS9, there were concerns about whether taking immunosuppressive or immune modifying medications affects the course and outcome of COVID-19 in PwMS10. To address these concerns, a multi-stakeholder data-sharing initiative known as the COVID-19 and MS Global Data Sharing Initiative (GDSI) was set up. The GDSI was guided by the joint efforts of the Multiple Sclerosis International Federation (MSIF) (MSIF: https://www.msif.org/) and the Multiple Sclerosis Data Alliance (MSDA) (MSDA: http://msdataalliance.org), where they leveraged their expertise and network to enable various data partners to share their data in various ways, resulting in three different data streams. The main objective of the GDSI was to investigate the effect of immunosuppressants or immune-modifying medications on COVID-19 or its outcomes. Moreover, the initiative aimed to scale up the COVID-19 data collection efforts and provide the MS community with data-driven insights during the pandemic10.

First and foremost, the key areas of interest among completed and ongoing data collection efforts related to COVID-19 in MS were agreed upon. This was done via a global consensus-building teleconference, in which the variables related to COVID-19, COVID-19 severity, COVID-19 treatment, demographic information, MS history and severity, information on DMT use, comorbidities and certain lifestyle behaviours were identified as important to be included in the core dataset10,11.

Subsequently, the global MS community was engaged to share the documentation of the COVID-19 status of PwMS. MS registries and cohorts from different countries were included in the initiative, and they were asked to share their core dataset via a central platform provided by QMENTA (https://www.qmenta.com/). Additionally, via a fast data entry tool, PwMS and/or their clinicians could share data directly in the platform. The data was securely stored by QMENTA, which is a company that provided a cloud-based platform that facilitates the aggregation, standardisation, and visualisation of various types of data, particularly clinical and imaging data10,11.

Rationale for Release and Limitations of the Dataset

This study is focused on publishing the data that was collected via the fast data entry tool. For the purposes of this paper, the data collected through this tool will be hereafter referred to as “direct entry dataset”. The release of the GDSI direct entry dataset12 is an important step towards the reproducibility and transparency of the research within the domain of MS. The processing and acquisition of this dataset12 have been performed keeping the European General Data Protection Regulation (GDPR) framework in mind13 and therefore for the first time, we are making this dataset12 publicly available.

As of the 31st of December 2022, the GDSI has been closed, meaning the central platform is no longer active, and all data available in the central platform has been deleted. The data originally provided by the participating data partners (via a data stream different from the direct entry) is kept securely by the individual data partners and can be retrieved if necessary for future (research) opportunities. The main motivation to share the direct entry dataset12 under an open license is for it to remain accessible for research beyond the GDSI. In this way, the availability of the direct entry dataset12 can aid in advancing our understanding of the effects of COVID-19 on PwMS. Moreover, it can serve as a validation dataset for different analytical models. It is important to note that the direct entry dataset12 differs from the datasets used in the previously published work11 because, besides being collected using the fast data entry tool, it has also been further anonymised using K-anonymity and ℓ-diversity. The motivation to do that was to ensure the privacy protection of the individuals while maintaining the dataset’s continued usability. Another reason to share the dataset12 under an open license is to appreciate and acknowledge the time invested by PwMS and clinicians that contributed to the GDSI direct entry dataset. Further information on the anonymity and acquisition of the data can be found in the methodology section.

The set-up, both from a technical- as well as governance perspective, that was used within the GDSI was unique. It allowed rapid data sharing at global scale and delivered data-driven insights to urgent clinical questions. The different data science concepts that were introduced and implemented within the GDSI could be leveraged to address other urgent questions in need of data-driven insights. However, a knowledge gap remains between the experts in MS (understanding the clinical questions) and the data scientists (understanding the technical architecture). There is a need for education to allow clinical experts to better understand complex data science-related topics. We are convinced that this openly licensed direct entry dataset12 can support this objective. The MS Data Alliance is a global non-for-profit multi-stakeholder organisation that aspires to overcome the socio-technical challenges that arise when scaling-up real-world MS data. One of the strategic objectives within the MS Data Alliance Academy is to educate the MS community on data science related topics through workshops, online tutorials and fellowship programs allowing personalised mentoring and training. The MS Data Alliance aspires to develop and host a series of workshops explaining how the GDSI was set-up. To allow the participants of these workshops to gain hands-on experience, the direct entry dataset12 described in this paper serves as the perfect resource to allow reproducing the GDSI set-up. Increased understanding of how the GDSI was set-up and how data-sharing at scale was facilitated will accelerate the collaborative research that is urgently required to deliver data-driven insights.

Methods

The data for the GDSI was collected in three ways: (1) by individuals directly entering the data into a central platform via the fast data entry tool, (i.e., the direct entry dataset12) (2) by collecting de-identified data from participating MS registries and cohorts, which were regularly invited to upload their COVID-19 in MS core data, and (3) by acquiring aggregated results from certain registries that did not provide individual patient information. Since, the scope of this study only encompasses the direct entry dataset, the methodology section of this paper will discuss the data collection process, anonymization, and analysis methods pertaining specifically to the direct entry dataset. The details of the infrastructure of the overall GDSI for data collection and ethical approvals have been described previously10,11

In the context of the direct entry dataset12, our study followed the ethical guidelines and received approval from the ethics committee of Hasselt University to further anonymize and publish the data collected via the fast data entry tool in the public domain as long as the anonymity of the patients is fully respected (Comité voor Medische Ethiek UHasselt, CME2020/025 AMD3). After agreement upon the variables to be collected via the direct entry dataset12 as described in the previous section, a small cell risk assessment (SCRA) was performed upon the variables only by P-95 (P-95: https://www.p-95.com/). P-95 is a consulting company that was deemed a trusted third party by the ethical committee of Hasselt University and was tasked to minimise the risk of a patient’s identity disclosure. As a result, the variables were divided into three different categories: “direct identifiers”, “sensitive variables”, and “indirect identifiers”, also known as identifiers, sensitive attributes and non-sensitive or quasi-identifiers, respectively14. Direct identifiers are structurally unique variables that can directly identify a person without extra information such as social security number, full name etc. Sensitive variables refer to the variables that have content that a respondent would not like to disclose, such as religion, political alignment, COVID-19 diagnosis etc. Consequently, indirect identifiers are non-structurally unique variables that can re-identify an individual if combined with other indirect identifiers within or from another dataset14.

Following that, entering data via the fast data entry tool was made possible on a global level to two distinct categories of users, namely clinicians and PwMS. The outreach to these groups included direct emails to clinicians working in the field of MS, engaging with MS patient societies, dissemination of information via newsletters and promotion of the tool and the intended use of the data acquired by the tool on the MSDA, MSIF websites and social media platforms. The participation and sharing of data via either data stream was on voluntary basis.

Upon accessing to the direct entry tool, a questionnaire was given to the users where they had to fill in the details in line with the variables10 decided in the core dataset. Additionally, while asking for predetermined variables, the questionnaire did not ask for any directly identifiable personal data such as name, social security number, address, etc., thus leading to the platform’s inability to identify individuals that uploaded their own or their patient’s data via the fast data entry tool. The tool was available globally and since at the time COVID-19 was a novel disease, the extensive geographical coverage made sense to ensure the gain of a high volume of data to gain insights. The data regarding specific healthcare facilities however, of which the respective respondents were part of, was not collected. The tool has been taken down since the 3rd of February, 2022.

Once the acquisition of the direct entry dataset12 was finalised, steps were taken on the indirect identifiers towards applying further de-identification strategies such as data generalisation and data suppression. The former replaces a data value with a less precise value by applying binning, categorisation or rounding, while the latter involves the removal of the entire column. Subsequently, privacy-preserving techniques such as K-anonymity and ℓ-diversity were applied as well to prevent any possible re-identification attacks in the future (see Fig. 1).

Fig. 1
figure 1

Anonymization Workflow for the Global Data Sharing Initiative Open Source Dataset. The workflow shows the anonymization process for the Global Data Sharing Initiative open source dataset. The dataset is first acquired, then aninymization methods are applied. The resulting anonymized dataset is then checked for bias using the chi-squared test independence. The final is an anonymized dataframe.

K-anonymity is a privacy-preserving technique that aims to make the data in indirect identifiers indistinguishable. This effectively means that at least “K” individuals in the dataset should share the same set of attributes of indirect identifiers. A dataset is thus said to have achieved K-anonymity if all combinations of values for indirect identifiers in the dataset are present in at least K different records15. For example, Fig. 2 shows a dataset that has achieved a K-anonymity of 2. This is because the combination of indirect identifiers such as sex, Age_in_cat and BMI_in_cat occurs at least two times.

Fig. 2
figure 2

Anonymised Dataset with Redacted Rows. The mock dataset shown is K-2 anonymised using the K anonymity method, where each combination of indirect identifiers must occur at least two times. Rows that do not meet this criterion are indicated by crossed-out red font and subsequently removed.

While K-anonymity can prove as an effective strategy for preventing identity disclosure, it nevertheless cannot ensure full protection of sensitive attributes, which can lead to semantic attacks, such as skewness and/or similarity attacks16. The ℓ-diversity model can potentially protect against these attacks. The ℓ-diversity model acts as a plugin on top of the K-anonymity model and focuses on diversity in sensitive attributes. ℓ-diversity is attained if each equivalence class of a K-anonymised dataset contains at least “ℓ” well-represented values for each sensitive attribute17. Figure 3 displays a dataset having K-anonymity and ℓ-diversity of 2. The dataset has at least two diverse values in the only sensitive variable, i.e., “covid19_confirmed_case”, for each combination of indirect identifiers. The rationale for choosing K-anonymity and ℓ-diversity stemmed from its previous use in literature18.

Fig. 3
figure 3

Anonymised and Diversified Dataset. The mock dataset shown is K-2 anonymised using the K anonymity method and l-2 diversified using the l-diversity technique. To ensure privacy, rows that do not meet the k-2 anonymity criterion are marked with a red crossed-out font, and rows that fail to satisfy the l-2 diversity requirement are market with a purple crossed-out font and subsequently eliminated. The dataset is required to have at least two occurrences of each combination of indirect identifiers and at least two classes in the sensitive variable column.

Upon the completion of the anonymisation of the dataset12, a bias assessment check was performed. Given the categorical composition of the indirect identifiers and sensitive variables, the relatively small sample size and the class imbalance in variables, a chi-squared test was conducted on the indirect identifiers and sensitive variables in the anonymised and the original dataset. Values of p < 0.05 are considered statistically significant. A SCRA was performed on the anonymised dataset12 by P-95 to ensure that the anonymised dataset is safe to be made public under an open licence. For analysis purposes, the anonymisation and statistical analysis of this dataset12 were performed using Python libraries such as matplotlib 3.6.019, pandas 1.5.320, NumPy 1.2421 and SciPy 1.022. The interface used was Jupyter notebook 5.023.

Data Records

In this section, a more detailed description of the variables included in the accessible dataset has been provided, which is available at PhysioNet and can be accessed via https://doi.org/10.13026/77ta-1866. This dataset12 consists of 1141 PwMS.

In terms of demographical characteristics, as shown in Tables 1, 77.8% of the cohort were within the age range of 18–50 years, with a distribution of 79.22% females and 20.77% males. The majority of the patients (75.02%) had a BMI below 30, and a significant portion (43.38%) were current or former smokers.

Table 1 Demographical characteristics of the cohort.

The health status, as detailed in Tables 2, 5.2% of the cases had a “confirmed” COVID-19 diagnosis, while “not_suspected” or “suspected” cases comprised 75.19% and 19.54%, respectively. Hospitalisation occurred in 5.25% of patients, and ICU admission was necessary for 13.67%. Among the hospitalised patients, a small proportion, 0.43%, required artificial ventilation. The dataset12 also indicates that 24.8% of the cohort reported having comorbidities, with 79.4% having relapsing-remitting MS (RRMS) and 9.11% having primary progressive MS (PMS). Additionally, 45.83% had an EDSS score between 0–6.

Table 2 COVID-19 and MS diagnosis and severity, and comorbidities.

Table 3 outlines the medication and treatment details. The most commonly used disease-modifying therapy (DMT) was dimethyl fumarate, utilized by 12.7% of the patients, followed by fingolimod (10.16%) and natalizumab (5.69%). The dataset12 also has missing values with the EDSS score and current or former smokers, among others, having 54.16% and 56.61% missing values, respectively.

Table 3 Disease modifying therapy information.

Variables Descriptions

  • secret_name: A unique identifier for the patient. “P_” or “C_” in the beginning indicates patient-reported and clinician-reported outcomes, respectively.

  • report_source: Indicates the source from which the data is acquired. The variable has two unique values: “clinicians” for clinician-reported and “patients” for patient-reported. Possible values: “Patients” (92.63%), “Clinicians” (7.36%).

  • age_in_cat: Indicates age in categories. Missing values for this field: 0.00%.

    • 0: if the age range is between 0 and <18.

    • 1: if the age range is between 18 and ≤50.

    • 2: if the age range is between 51 and ≤70.

    • 3: if the age range is 71 or greater.

  • bmi_in_cat2: This variable represents the body mass index (BMI) of the patient. BMI is a statistical index that can estimate body fat in people of any age by dividing a person’s weight in kilograms by the square of height in metres24. The unique values in bmi_in_cat2 are “not_overweight” (75.02%) and “overweight” (2.97%). The possible missing values are 21.99%.

    • “not_overweight”: if BMI ≤ 30 kg/m2.

    • “overweight”: if BMI > 30 kg/m2.

  • covid19_admission_hospital: Indicates the hospital admission status of the patient as a result of COVID-19. Has two unique values: “Yes” (1.31%) indicates admission in the hospital, and “No” (98.78%) indicates no admission. 0.00% missing values.

  • covid19_confirmed_case: Confirmed COVID-19 diagnosis of the patient. Has two unique values: “Yes” (5.25%) if the diagnosis is positive and “No” (94.74%) if it’s otherwise. 0.00% missing values.

  • covid19_diagnosis: It shows the perceived COVID-19 diagnosis of the patient. It has three unique values: “not_suspected” (75.19%), “suspected” (19.54%) and “confirmed” (5.25%). 0.00% missing values. “confirmed” refers to the condition if covid_19_confirmed_case has a value of yes. “suspected” if the patient had a suspicion of having covid-19 and any covid_19 symptoms such as sore throat, shortness of breath, pneumonia, fatigue, pain, nasal congestion and chills. “not_suspected” was assigned in case there was an absence of symptoms, suspicion of having COVID-19 or a confirmed negative diagnosis.

  • covid19_has_symptoms: Indicates the presence or absence of COVID-19 symptoms. Has two unique values: “Yes” (31.77%) if there are one or more symptoms associated with COVID-19, such as fever, dry cough, fatigue, pain, sore throat, shortness of breath, nasal congestion, loss of taste or smell and pneumonia. “No” (68.22%) if there are no symptoms. 0.71% possible missing values.

  • covid19_icu_stay: Indicates whether the patient stayed in the intensive care unit (ICU) of the hospital as a result of COVID-19. “Yes” (0.35%) if the person stayed in the hospital intensive care unit (ICU) and “No” (99.38%) if the person did not stay in the ICU of the hospital. 0.26% possible missing values.

  • covid19_self_isolation: Self-isolation status of the patient, advised either by the clinician or self-reported. “Yes” (39.87%) if the person self-isolated and “No” (58.8%) if the person did not self-isolate. Possible missing values are 1.31%.

  • covid19_sympt_chills: Presence of chills as a COVID-19 symptom. “Yes” (8.85%) indicates the presence of chills as a symptom, and “No” (12.88%) indicates otherwise. 78.26% possible missing values.

  • covid19_sympt_dry_cough: Presence of dry cough as a COVID-19 symptom. “Yes” (17.17%) if there are symptoms of dry cough and “No” (7.44%) if there are no symptoms of it. 75.37% possible missing values.

  • covid19_sympt_fatigue: Presence of fatigue as a COVID-19 symptom. “Yes” (20.94%) if there are symptoms of fatigue and “No” (4.55%) if there are no symptoms of it. 74.49% possible missing values.

  • covid19_sympt_fever: Presence of fever as a COVID-19 symptom. “Yes” (12.35%) if there are symptoms of fever and “No” (12.62%) if there are no symptoms of it. 75.02% possible missing values.

  • covid19_sympt_loss_smell_taste: Presence of loss of smell and taste as a COVID-19 symptom. “Yes” (7.17%) if there are symptoms of loss of smell and taste and “No” (13.75%) if there are no symptoms of loss of smell and taste. 75.37% possible missing values.

  • covid19_sympt_nasal_congestion: Presence of nasal congestion as a COVID-19 symptom. “Yes” (13.67%) if there are symptoms of nasal congestion and “No” (9.46%) if there are no symptoms of it. 76.86% possible missing values.

  • covid19_sympt_pain: Presence of pain as a COVID-19 symptom. “Yes” (16.3%) if there are symptoms of pain and “No” (7.88%) if there are no symptoms of it. 75.81% possible missing values.

  • covid19_sympt_pneumonia: Presence of pneumonia as a COVID-19 symptom. “Yes” (1.22%) if there are symptoms of pneumonia and “No” (18.75%) if there are no symptoms of it. 80.01% possible missing values.

  • covid19_sympt_shortness_breath: Presence of shortness of breath as a COVID-19 symptom. “Yes” (9.02%) if there are symptoms of shortness of breath and “No” (13.40%) if there are no symptoms of it. 77.56% possible missing values.

  • covid19_sympt_sore_throat: Presence of sore throat as a COVID-19 symptom. “Yes” (14.89%) if there are symptoms of sore throat and “No” (9.02%) if there are no symptoms of it. 76.07% possible missing values.

  • covid19_ventilation: Indicates whether the patient used a ventilator unit for ventilation during their hospital stay. “Yes” (0.35%) if ventilation was used, and “No” (99.21%) indicates otherwise. 0.43% possible missing values.

  • current_dmt: Indicates the status of the disease-modifying therapy (DMT) at the time of data entry of the patient. There are three unique values in the variable: “yes” (77.12%), “no” (8.85%) and “never_treated” (14.02%). 0.00% possible missing values. The values are measured as follows:

    • yes: if the person is currently on a DMT

    • no: if the person is not currently on a DMT

    • never_treated: if the person has never been on a DMT

  • dmt_glucocorticoid: Describes the status of intake of glucocorticoid. “Yes” (4.2%) if the person is taking glucocorticoid, and “no” (89.65%) states otherwise. 6.13% possible missing values.

  • edss_in_cat2: Indicates the category in which the Expanded Disability Status Scale (EDSS) lies. The EDSS is one of the most commonly used disability assessment tools25. The EDSS is an ordinal scale from 0 to 10 that runs in increments of 0.5, with 0 indicating normal neurological status and 10 indicating death due to MS. Even though it has a few shortcomings, such as low reliability and sensitivity to change, it is still one of the preferred scales25,26. The variables had two unique values: “1” (0.00%) and “0” (45.8%). 54.16% possible missing values. The values were calculated as follows:

    • 0: if the EDSS is between 0 and < = 6.

    • 1: if the EDSS is >6.

  • pregnancy: Pregnancy status of the patient. “Yes” (0.43%) if the person is pregnant and “No” (74.75%) if the person is not. Possible missing values are 24.80%.

  • sex: The biological sex of the patient. “Male” (20.77%) for males and “Female” (79.22%) for females. Possible missing values are 0.00%.

  • ms_type2: The type of Multiple Sclerosis phenotype. This variable has three unique values: “relapsing_remitting” (79.4%), “progressive_MS” (9.11%) and “other” (11.48%). Possible missing values are 0.00%. The values were calculated as follows:

    • relapsing_remitting: if the type of MS is relapsing-remitting MS (RRMS)

    • progressive_MS: if the type of MS is secondary progressive MS (SPMS) or primary progressive MS (PPMS)

    • other: if the type of MS is a clinically isolated syndrome (CIS) or empty or “not_sure” in case the patient or clinician was not sure.

  • current_or_former_smoker: Indicates the smoking status of the patient. “Yes” (43.38%) indicates whether a patient is a smoker and/or has been a former smoker. “No” (0.00%) indicates otherwise. Possible missing values are 0.00%.

  • dmt_type_overall: Indicates the specific type of DMT the person was on during data entry. The unique values in the variable are: “Currently on another drug not listed” (16.21%), “Currently on dimethyl fumarate” (12.7%), “Currently on fingolimod” (10.16%), “Currently not using any DMT” (8.85%), “Currently on interferon” (8.23%), “Currently on ocrelizumab” (7.36%), “Currently on natalizumab” (5.69%), “Currently on glatiramer” (5.6%), “Currently on teriflunomide” (5.52%), “Currently on cladribine” (3.06%), “Currently on rituximab” (1.31%), “Currently on alemtuzumab” (1.13%). Possible missing values are 14.11%. The values were calculated as follows:

    • “No information on DMT use”: if there is no information in the present or past history of DMT use by the person.

    • “currently not using any DMT”: if the person is currently not using any DMT but has used DMT in the past or has not used DMT at all.

    • “currently on interferon”: if the current DMT of the person is on interferon.

    • “currently on glatiramer”: if the current DMT of the person is on glatiramer.

    • “currently on natalizumab”: if the current DMT of the person is on natalizumab.

    • “currently on fingolimod”: if the current DMT of the person is on fingolimod.

    • “currently on dimethyl fumarate”: if the current DMT of a person is on fumarate.

    • “currently on teriflunomide”: if the current DMT of the person is on teriflunomide.

    • “currently on alemtuzumab”: if the current DMT of the person is on alemtuzumab.

    • “currently on cladribine”: if the current DMT of the person is on cladribine.

    • “currently on siponimod”: if the current DMT of a person is on siponimod.

    • “currently on rituximab”: if the current DMT of the person is on rituximab.

    • “currently on ocrelizumab”: if the current DMT of the person is on ocrelizumab.

    • “currently on another drug not listed”: if the person is on DMT other than the above-listed ones.

  • duration_treatment_cat: The duration of treatment of MS. The variable has two unique values: “0” (3.59%) and “1” (3.59%). Possible missing values are 80.54%. The unique values were calculated as follows:

    • 0: if the duration of treatment is less than 11 years.

    • 1: if the duration of treatment is 11 years or more.

  • stop_or_end_date_combined: Date in dd/mm/yyyy format indicating the stopping of DMT. Possible missing values are 28.13%.

  • covid19_outcome_levels_2: The outcome of COVID-19. The variable has three unique values: “0” (98.68%), “1” (0.87%), and “2” (0.43%). Possible missing values are 0.00%. The unique values were calculated as follows:

    • 0: If the person has COVID-19 but has not been hospitalised.

    • 1: The person has COVID-19 and has been hospitalised.

    • 2: The person has COVID-19, has been hospitalised, has been in the intensive care unit and/or was in a ventilation facility.

    • 3: The person died due to COVID-19 (not present in this dataset).

  • has_comorbidities: Indicates whether the person has any comorbidities. “Yes” (28.74%) indicates that there are comorbidities, and “No” (71.25%) indicates otherwise. Possible missing values are 0.00%.

  • com_cardiovascular_disease: Indicates the presence of cardiovascular comorbidities. “Yes” (1.13%) indicates that there are comorbidities, and “No” (22.08%) indicates otherwise. Possible missing values are 76.77%.

  • com_chronic_kidney_disease: Indicates whether the person has any chronic kidney disease. “Yes” (0.35%) indicates that there are chronic kidney comorbidities, and “No” (22.61%) indicates otherwise. Possible missing values are 77.03%.

  • com_chronic_liver_disease: Indicates whether the person has any chronic liver disease. “Yes” (0.87%) indicates that there are chronic liver comorbidities, and “No” (22.43%) indicates otherwise. Possible missing values are 76.68%.

  • com_diabetes: Indicates whether the person has any diabetes. “Yes” (1.48%) indicates that there is diabetes, and “No” (21.99%) indicates otherwise. Possible missing values are 76.51%.

  • com_hypertension: Indicates whether the person has hypertension. “Yes” (4.64%) indicates that there is hypertension, and “No” (19.63%) indicates otherwise. Possible missing values are 75.72%.

  • com_immunodeficiency: Indicates whether the person has an immunodeficiency. “Yes” (2.54%) indicates that there is immunodeficiency, and “No” (20.42%) indicates otherwise. Possible missing values are 77.03%.

  • com_lung_disease: Indicates whether the person has any lung disease. “Yes” (2.80%) indicates that there is a lung disease, and “No” (20.50%) indicates otherwise. Possible missing values are 76.68%.

  • com_malignancy: Indicates whether the person has any malignancy. “Yes” (1.05%) indicates that there is malignancy, and “No” (21.91%) indicates otherwise. Possible missing values are 77.03%.

  • com_neurological_neuromuscular: Indicates whether the person has any neurological and/or neuromuscular comorbidity. “Yes” (2.19%) indicates that there is neurological and/or neuromuscular comorbidity, and “No” (21.12%) indicates otherwise. Possible missing values are 76.68%.

  • comorbidities_other: Indicates names of other comorbidities that the patient might have that are not mentioned in the column names. Possible missing values are 79.66%

Technical Validation

The technical validation of this dataset12 was performed by adhering to a well-defined schema that facilitated the acquisition of data in a standardised manner. Furthermore, an automated data quality assessment pipeline was integrated during the data acquisition phase, which ensured the accuracy and reliability of the dataset12. For instance, constraints were placed on data entry, such as limiting the age range and only allowing ICU admission entries for people who were also documented as being admitted to the hospital.

The geographical data was suppressed during the de-identification process because including it would have compromised the level k-anonymity that we intended to achieve. Moreover, data generalisation was applied on the age variable to represent it in a categorical format. This helped in maintaining the anonymity of the dataset12 while retaining its utility. No direct identifiers were present in the dataset12. Additionally, the dataset12 contained the following sensitive attributes, i.e., “covid19_ventilation”, “com_cardiovascular_disease”, “com_chronic_liver_disease”, “com_chronic_kidney_disease”, “com_diabetes”, “com_hypertension”, “com_immunodeficiency”, “com_lung_disease”, “com_malignancy’, “com_neurological_neuromuscular” and two indirect identifiers such as “sex”, and “age_in_cat”. Based on the indirect identifiers and the sensitive attributes, the anonymised dataset12 achieved a K-anonymity of 60 and ℓ-diversity of 2.

After running the chi-square test (χ²) on the indirect identifiers and sensitive variables in the anonymised and the original dataset, the null hypothesis stating that there are no significant differences in the distributions was evaluated. The details of the analysis are displayed in Table 4. Initially, all the variables except “age_in_cat” exhibited a statistically significant change in distribution. This is because the “age_in_cat” in the original dataset had an extra category with 6 individuals. However, since anonymisation is a trade-off between privacy preservation and usefulness, and given the small sample size of the minor class in age_in_cat, that category was eventually removed. Therefore, when the adjusted “age_in_cat” with the same number of categories in both original and anonymised datasets12 were analysed, the p-value became insignificant, i.e., (χ²(1) = 0.0038, p = 0.95).

Table 4 Chi-Squared test results for the sensitive and indirect-identifiers of the anonymised dataset.

Thus, the analysis in Table 4 implied an insignificant change in the distribution of the variables in the original and anonymised dataset12 with respect to the indirect identifiers and sensitive attributes. While the absence of an insignificant change in the distribution does not necessarily imply an absence of bias, it does provide preliminary support for the utility of the anonymised dataset for further research purposes. Nevertheless, for certain specific categories, such as hospitalisation and intensive care unit (ICU) admission, the anonymised dataset12 was not able to fully capture the extreme minority cases. This discrepancy can be seen in Table 5.

Table 5 Original vs Anonymised: Differences in severe cases.

Usage Notes

Despite the potential benefits, there are certain limitations of this dataset12 as well. First, the dataset12 does not take into account false positive or false negative cases of COVID-19. Second, the size of this dataset12 is small and can thus impede the statistical power of the analysis performed on it, resulting in possible imprecise estimates of the association between MS and COVID-19. A striking example is the presence of only 5.2% (n = 60) PwMS in the dataset12 with confirmed COVID-19 diagnosis. This can limit the robustness of the dataset’s ability to provide reliable insights regarding the impact of COVID-19 on PwMS. Lastly, the dataset12 can have a potential selection bias. This pertains specifically to the cases reported by PwMS themselves as critically ill and deceased cases would most likely not be included in the dataset12. Furthermore, some of the rows were also eliminated during the application of privacy enhancement strategies such as K-anonymity and ℓ-diversity. These limitations should be taken into consideration when interpreting the results of any analysis conducted using this dataset12.