Disaggregated data on age and sex for the first 250 days of the COVID-19 pandemic in Bucharest, Romania

Experts worldwide have constantly been calling for high-quality open-access epidemiological data, given the fast-evolving nature of the COVID-19 pandemic. Disaggregated high-level granularity records are still scant despite being essential to corroborate the effectiveness of virus containment measures and even vaccination strategies. We provide a complete dataset containing disaggregated epidemiological information about all the COVID-19 patients officially reported during the first 250 days of the COVID-19 pandemic in Bucharest (Romania). We give the sex, age, and the COVID-19 infection confirmation date for 46,440 individual cases, between March 7th and November 11th, 2020. Additionally, we provide context-wise information such as the stringency levels of the measures taken by the Romanian authorities. We procured the data from the local public health authorities and systemized it to respond to the urgent international need of comparing observational data collected from various populations. Our dataset may help understand COVID-19 transmission in highly dense urban communities, perform virus spreading simulations, ascertain the effects of non-pharmaceutical interventions, and craft better vaccination strategies.


Background & Summary
Since the onset of the pandemic, the volume of COVID-19 data made available to the public has been unprecedented 1 . Yet, disaggregated data are still scarce, incomplete, sometimes contradictory, and with little cross-country comparability. Various reports have substantiated the need for better data. It has already been pinpointed the sheer importance of disaggregated information by gender, age, geographic location, ethnicity, and other variables relevant in a national context, especially for the developing counties 2 . Some scholars have posited that failing to acknowledge the importance of disaggregated data on gender and sex may result in significant inequalities in access to health services 3 . Others have propounded that the current lack of COVID-19 disentangled data will increase the existing sex and gender data gaps, which in turn will increase gender disparities in the health and socioeconomic effects of the pandemic, with a negative impact on females 4 . Various experts worldwide have advocated the need to produce and standardise age-disaggregated health data to improve usability and cross-country comparison 5 . They have showcased that failing to do so results in misinterpreting the patterns of SARS-CoV-2 transmission among and beyond the cohort of children, the process of prioritizing vaccination, and the inspection of the secondary effects.
As of January 21 st , 2022, Romania has had the second-lowest vaccination rate in Europe (41% cumulative uptake of full COVID-19 vaccination scheme) 6 . On top of that, Romania is among the top ten European countries with the highest death toll (3,106 per one million people) 7 . Located in the southeast of Romania, Bucharest is the capital city and the primary economic agglomeration of the country. Services -the most impacted sector during the COVID-19 pandemic-represent the leading supplier for the local economy 8 . With a total resident

Methods
The dataset comprises 46,440 COVID-19 patients officially confirmed and reported between March 7 th and November 11 th , 2020, in Bucharest (before the start of the vaccination campaign). These real-world data were procured from the Bucharest Public Health Department (the BPHD), Romanian Ministry of Health. To receive the data, our team sent an official request address to the BPHD (i.e., Address No. 14870, August 18 th , 2020) on behalf of the University of Bucharest. The BPHD provided the dataset based on approval No. NT3054E (August 28 th , 2020) in response to our request. Our study received ethical approval (Decision No. 1, September 1 st , 2020) from the Ethics Committee of the Department of Sociology (University of Bucharest). Before their being procured, the data were anonymized by the BPHD. The 46,440 observations (patients) correspond to an investigated time window of 250 days (March 7 th -November 11 th , 2020). Subsequently, we qualitatively analysed the NPIs advanced by the public health authorities in Romania 18 during the 250-day timeframe. Accounting for their level of severity, we derived five stringency phases. Each of the 250 days was nested in one of the five phases. Figure 1 describes the steps of producing the "COVID-19 in Bucharest" dataset (or, briefly, the dataset). For each of the 46,440 COVID-19 patients, we have records of their biological sex, age, area of residence, or quarantine, as well as their official confirmation date. Each patient also has a unique code of identification (patient_ID) assigned by the BPHD. These records are marked in red in Fig. 1. Information about the NPIs corresponding to the first 250-day time interval was retrieved from official press releases uploaded on the webpage of the Romanian Ministry of Internal Affairs 18 and from newsletters published in the "Press releases" section on the webpage of the Romanian Ministry of Public Health 19 . This information was expressed in the form of the "phase" variable (marked in blue in Fig. 1). Additionally, our team derived new variables from the original data records provided by the BPHD, i.e., observation number, age_groups, month, week, and 14-day interval. These variables are marked in green in Fig. 1. Our team performed various technical validation checks on the "COVID-19 in Bucharest" dataset (plausibility, completeness, and conformance checks). These are presented in greater detail in the "Technical validation" section. For brevity, we only mention here that during the technical validation stage, we compared the data records in our dataset to the information enclosed in other public available datasets 20-22 . Fig. 1 The sequence of steps taken to produce the dataset.
We used the severity degree of the NPIs to delineate the stringency phases. In Fig. 2, for informative and illustrative purposes, we represent the evolution of the COVID-19 cases in a longitudinal fashion (between March 7 th and November 11 th , 2020) by major NPIs, stringency phases, sex, and age groups. The stringency phases are displayed in temporal order. The information displayed in the main content and Fig. 2 represents the authors' contribution.
Stringency "Phase 1" ranges between March 7 th and March 15 th , 2020, and corresponds to "low to moderate measures. " Forty-seven COVID-19 cases are confirmed during this phase. The major NPIs until March 15 th are: restriction of outdoor or indoor public and private events to 1,000 participants (March 8 th ), cessation of flights (March 9 th ), bus rides (March 10 th ), and railway travel (March 12 th ) to and from Italy, suspension of face-to-face classes (March 11 th ) in all pre-university level schools and some universities, limitation of the number of participants in indoor cultural, scientific, religious and sports activities to 50 (March 11 th ). "Phase 2", matching the most severe measures, corresponds to "the state of emergency in Romania" and ranges between March 16 th and May 14 th , 2020. One thousand five hundred fifty-one cases are confirmed amid these dates. The preliminary NPIs adopted during the state of emergency are: an extension of the suspension of in-person classes; permission of only takeaway and delivery services in restaurants and shopping malls; closure of hotels and clubs; prohibition of indoor cultural, scientific, religious, and sports events; the restriction of outdoor personal events to 100 participants; the cessation of flights to and from Spain. On March 24 th , the military is deployed to help enforce a national lockdown. Movement outside the household is generally prohibited for non-essential purchases, with persons over 65 having their outdoor activity restricted, at first, to a two-hour interval (March 25 th ) and, then, to a three-hour interval (March 29 th ). Starting April 27 th , non-essential movement outside the household is permitted for persons above the age of 65 both in the morning (7-11 am) and during the evening (7-10 pm).
"Phase 3", with moderate measures, ranges between May 15 th and June 16 th , 2020. Nine hundred thirty-two cases are confirmed during this interval. May 15 th marks the onset of the first "state of alert", allowing hairdressing salons, barbershops, and dental clinics to reopen. Facial masks generally become mandatory in indoor public spaces, public transportation included. From June 1 st , gradually, some of the movement restrictions are lifted, outdoor concerts and cultural events are permitted, and restaurant terraces reopen. "Phase 4", with 18,499 confirmed cases, is a phase of the least stringent measures, ranging between June 17 th and October 6 th , 2020. Since June 17 th , shopping malls, public nurseries, and pre-schools are allowed to reopen. Pre-university level schools Fig. 2 The evolution of COVID-19 confirmed cases between March 7 th and November 11 th , 2020, in Bucharest, Romania. We illustrate the new daily cases by sex (a) and age-groups (b) while accounting for the stringency phases. The rendered information represents the authors' contribution.
www.nature.com/scientificdata www.nature.com/scientificdata/ reopen on September 14 th . Local elections are held on September 27 th , with a participation rate of 37% of the 18+ Bucharest population 23 . For the entire phase, facial masks remain mandatory. "Phase 5", with low to moderate measures, ranges between October 7 th and November 11 th , 2020. Authorities confirm 25,274 COVID-19 cases during this phase. As of October 7 th , theatres, cinemas, and show venues are closed in Bucharest, while restaurants are restricted to outdoor service only. Opening hours of stores are limited, and a night curfew is being imposed. Starting November 9 th , all in-person classes are suspended. Masks are mandatory in both outdoor and indoor places, working spaces included. Public and private gatherings are cancelled. Table 1 displays the major NPIs by stringency phase and time interval.

Data Records
We deposited a copy of the dataset (i.e., the Bucharest COVID-19 dataset) to the generalist repository figshare 24 . Data are available in an Excel file format (.xlsx), facilitating importation into various statistical software programs such as R, Python, SPSS, Stata, SAS, or conversion to comma-separated value format (.csv). In the database, the rows designate unique individual observations (COVID-19 confirmed cases). Each observation is assigned a number from 1 to 46,440 (the total number of cases) and ascendingly ordered. Overall, the dataset renders information about all the officially COVID-19 confirmed individual cases in Bucharest, since the onset of the pandemic in the city, on March 7 th , till November 11 th , 2020. Specifically, we give information on the COVID-19 confirmation date and each patient's sex, age, and geographical (administrative) location (the Suspension of face to face pre-university classes Mar. 11 th -Sep. 14 th Suspension of face to face pre-university classes Nov. 9 th -Nov. 11 th Table 1. GANTT table displaying the major Non-Pharmaceutical Interventions during the first 250 days of the COVID-19 pandemic in Bucharest. a November 11 th is not the date until a measure is effective, but it is the last day in our dataset. b Movement outside the household is prohibited for non-essential purchases. c Outdoor activity for persons over 65 is restricted to a two-hour interval. d Outdoor activity for persons over 65 is limited to a three-hour interval. e Non-essential movement outside the household is permitted for persons above the age of 65, both in the morning (7-11 am) and during the evening (7-10 pm). f Facial masks and gloves are mandatory in public indoor spaces (public transportation included). g Restriction of outdoor activities and events to 1,000 participants. h Limitation of the number of participants in indoor cultural, scientific, religious, and sports activities to 50. The information illustrated in the table represents the authors' contribution. www.nature.com/scientificdata www.nature.com/scientificdata/ administrative district in Bucharest). We also give the patients' ID codes as provided by the BPHD. These ID codes are helpful for joining (linking) this dataset to other available datasets 16,17 . Additionally, we derive four new variables from the original information. Namely, based on its COVID-19 confirmation date, each observation is nested in a (calendar) month, week, 14-day interval, and stringency phase. We make available the dataset in the English language to increase its international usage by health professionals, policy-makers, scientists, and other interested parties. We use self-explanatory and straightforward variable labels and values for users' convenience. Also, we mark missing data by "NA" codes. We continue this section with a detailed description of all variables (data fields) available in the dataset ( Table 2).
Observation_number. There are 46,440 unique data entries (observations or patients) in the dataset. Therefore, the variable takes values from 1 to 46,440. The observation numbers are ascendingly sorted. These numbers do not reflect epidemiological progression and should not be used for that purpose. Instead, this variable uniquely identifies each observation in the dataset.
Patient_ID. The BPHD had assigned each COVID-19 patient a numerical code due to the anonymization process. We render these patient ID codes as it is useful for joining (linking) the dataset to other available datasets 16,17 .
Sex. This represents the biological sex of the tested individuals (patients). The dataset comprises 24,696 female patients (53%) and 21,744 male patients (47%). We report no missing cases on this variable. The variable takes two values: "male" and "female".
Age. This variable captures the equivalent of age in completed years, with values ranging from 0 (less than one year of age) to 101. The average value is 43.4, the median is 43.0, and the most frequent value within the dataset is 52.0. The standard deviation is 17.7. Thirty-six cases have missing data (these represent less than 0.1% of all cases). In the dataset, we mark missing values by NA.
Age_groups. We recode the "age" variable into five-year age groups (brackets), with categories ranging from 0-4 y.o. to 85+ y.o. We derive a total of 18 groups. The category with the highest absolute frequency is the 40-44 y.o. group (namely, 5,650 positive cases correspond to this age group, accounting for roughly 12% of all data entries). Thirty-six patients have missing information (these represent less than 0.1% of all cases). In the dataset, we mark missing values by NA.
District. This variable refers to the six administrative units (or sectors) forming the Municipality of Bucharest, each governed by a mayor. The six districts have a clockwise geographic arrangement. For instance, District 1 is located in the north, District 4 in the south, and District 6 in the west of the city (Fig. 3). The exact district is not available for 8,043 cases (that is 17.3% of all the data entries). For these cases, we add a generic location -"Bucharest". The "district" marks the place of residence or quarantine for a specific individual case. Due to Fig. 4 Statistically significant differences between the structure of COVID-19 confirmed cases in Bucharest and the structure of the resident population of Bucharest. We illustrate the significant differences by age groups and sex: females (a) and males (b). Brighter yellow colours designate high positive differences (more COVID-19 cases than expected when compared with the total population) and dark blue colours designate high negative differences (fewer COVID-19 cases than expected when compared with the total population).

Month.
We derive this variable from the "confirmation_date" variable. We assign each "confirmation_date" to a "month" (each date is nested in a month). Consequently, we obtain nine months of investigation (March -November 2020). Two of these months are incomplete: March and November 2020. Further, October has the highest absolute frequency of observations (19,429 cases accounting for 41.8% of all cases). Also, in the dataset, we have 137 data entries with missing data that are marked by NA. The variable takes the following values: "March", "April", "May", "June", "July", "August", "September", "October", and "November".
Week. We derive this variable from the "confirmation_date" variable. We assign each "confirmation_date" to a "week" (each date is nested in a week). The variable takes as values the week number (e.g., w10, w11, w12, …, w46). Week counting starts from the beginning of the year (2020) -the first week of 2020 is January 1st -January 4th. We consider each week begins with Sunday. In our dataset, the pandemic onset in Bucharest is in week 10 (i.e., w10). The largest number of reported cases is in week 45 (6,105 observations accounting for 13% of all cases). Also, in the dataset, we have 137 data entries with missing data that are marked by NA.
14_day_interval. We derive this variable from the "confirmation_date" variable. We assign each "confirma-tion_date" to a 14-day-time-interval or two-week time window (each date is nested in a two-week time interval). We build this variable for potential analysis purposes. The variable takes as values week numbers (e.g., "w10_w11", "w12_w13", "w14_w15", "w16_w17", …, "w44_w45", "w46_w47"). We showcase that "w10_w11" and "w46_w47" (the beginning and the end of the time window) are incomplete. We notice that the minimum number of unique COVID-19 confirmed cases (i.e., 40 representing less than 0.1% of all observations) is in weeks 10-11. Also, the

technical Validation
Before being transferred to our team, the data were collected, curated, and anonymized by the Bucharest Public Health Department (BPHD), The Romanian Ministry of Health. The BPHD performed data anonymization to solely safeguard personal data and not to infringe on the data reliability or correctness and comprehensiveness. The BPHD is a Romanian public authority mandated to develop public health policies and programs, devise preventive measures and collect public health statistics. In this regard, the BPHD is a reliable official source of data. The authors do not have details of the collection of the epidemiological data and, therefore, cannot assess the reliability of the dataset acquired from the BPHD. After receiving the dataset from the BPHD, the authors performed additional checks to ensure the technical quality. Firstly, we implemented plausibility checks, looking for duplicate cases. We identified 118 duplicate cases -with identical values on all variables. These cases were eventually removed from the dataset. Secondly, we performed completeness checks. Namely, we closely examined the variables in searching for unavailable information. We detected less than 0.3% missing values in relation to two variables, i.e., confirmation_date and age (specifically, 137 and 36 cases, respectively). We did not employ any imputation technique for these missing observations. Subsequently, we marked the missing data with "NA". Thirdly, we executed conformance checks and compared the information embedded in our dataset to available alternative data. To that end, we deployed three comparisons. We compared the Bucharest COVID-19 dataset against the first 147 disaggregated records officially announced by the health authorities at the onset of the pandemic in Romania 12,22 . For each of the 147 records, the authorities provided various pieces of information: the confirmation date, patients' age, sex, residence, probable contacts, and travel history. A human-to-human  www.nature.com/scientificdata www.nature.com/scientificdata/ network analysis of these first 147 records is available in the literature 12 . Further, we compared our dataset to the total population of COVID-19 cases officially reported in Romania for the same time window (March 7 th -November 11 th , 2020). We illustrate the two-time series in Table 3. Also, we compared our dataset to the most recently updated data structure of the resident population in Bucharest (as of July 1 st , 2020) 20 . Tables 4 and 5 illustrate the comparison between the Bucharest COVID-19 dataset and the structure of the Bucharest resident population. Eventually, we compared the stringency phase variable from our dataset to the stringency index developed by the University of Oxford 25 .
Supplementary, we deployed data type checks to ensure that the data entered had the correct data type. For instance, we examined whether the age variable contains only numerical values. Further, we ran a range check to verify whether the values taken by our variables fall within a predefined range. For example, whether the age variable has taken a reasonable range of values. Finally, we performed a format check to ensure that our variables had the predefined format. For instance, we assessed whether the values taken by the confirmation_date variable are all stored in the same fixed format, i.e., MM/DD/YYYY.

Usage Notes
Our data records illustrate the COVID-19 prevalence in an urban community (Bucharest) for the first 250 days by providing a high-level granularity dataset. Precisely, the dataset comprises individual-level covariates, such as the age and sex of the officially confirmed patients, in a longitudinal (daily) fashion. We hope to make a contribution to the current international efforts of coalescing disaggregated empirical evidence on the spread of COVID-19. Our dataset may prove a valuable instrument for public health experts, policy-makers, scientists, and even journalists interested in assessing and better understanding COVID-19 spread in urban communities, especially before introducing the vaccines. For example, our data may demonstrate its utility in informing the efforts of scientists to statistical model or simulate the spread of diseases, in general, and of respiratory viruses, in particular. Our disaggregated observations may assist public health experts in building a comprehensive picture of the epidemiological situation. Moreover, it may help scientists establish (confirm) causal inferences in virus circulation patterns and solve potential problems, such as Simpson's paradox 26 . Furthermore, this dataset is suitable for European or global comparisons as it comprises individual-level cases that allow for standardization.
If our dataset is supplemented with compatible information available from other sources 16,22,27 , it may prove fruitful for gearing pharmaceutical interventions (e.g., vaccination efforts and strategies). For example, a subsample of this dataset has been partly used to estimate the role of age in spreading COVID-19 across a social network in Bucharest 16,17 . The age and sex of the patients confirmed positive between August 1 st and October 31 st , 2020, were input into relational hyperevent models 28,29 . Precisely, these two covariates were combined with network data (human-to-human transmission chains) to test for age and sex homophily effects 17 . Additionally, the variables embedded in our dataset may also be used, as covariates, in conjunction with network data, for estimating Exponential Random Graph Models (ERGMs) 15,30 .
Furthermore, the current dataset may prove its utility in assessing the impact of the NPIs advanced by the local authorities in Bucharest between March 7 th and November 11 th , 2020. Various insights concerning the effects of the NPIs may be inferred using the sex and age covariates. For instance, the information in Table 6 implies significantly higher shares of COVID-19 cases among females when the stringency levels of the NPIs are higher. Further, the evidence exhibited in Table 6 may support the very few previous studies 31 that claim the average age of COVID-19 patients decreases and stabilizes over time.
Our dataset may also provide a better understanding of the susceptibility to infection by biological sex. Already available scientific evidence has pointed to lower treatment efficiency 32 , greater rates of hospitalization 33 ,  www.nature.com/scientificdata www.nature.com/scientificdata/ higher probability of intensive treatment 34 , and a higher risk of death for males [35][36][37] . Still, the evidence is mixed when looking at confirmed cases. Earlier reports find a sex imbalance, with COVID-19 male patients having a greater risk of infection 38,39 . However, more recent studies uncover no difference between males and females regarding susceptibility to infection 40,41 .
We find in our dataset that, overall, significantly more females were confirmed with COVID-19 than males (χ 2 = 187.64, df = 1, p = 0.000). Approximately 53.2% of all confirmed cases were females. The high level of detail in our data shows how males and females were affected during the first 250 days of the pandemic in Bucharest. For illustrative purposes, we report the differences between the Bucharest COVID-19 dataset and the resident population of Bucharest, by sex and age groups, at a 14-day time interval (Fig. 4). For females, differences range from −6.8 to +10.6, while for males, from −6.3 to +7.3. In Fig. 4, bright yellow colours designate high positive differences (more COVID-19 cases than we would expect if compared with the total population), and dark blue colours designate high negative differences (fewer COVID-19 cases than what we would expect if compared with the total population). Negative differences were found in the 0-19 age group and the 70+ age group, irrespective of the sex and confirmation date. Furthermore, the dataset provides evidence for a disproportionate impact of COVID-19 on sex, during the first few weeks of the pandemic. Throughout weeks 12 to 21, significantly high positive differences can be noticed for females aged between 40 and 54. More adult age females were tested positive during the state of emergency than we would expect compared to their share in the total resident population. The effect is not visible in the case of men. In sum, these results may reveal occupational segregation and, consequently, give support to existing reports about the unbalanced composition of the global health workforce (with females representing about 70% 42 ).
Comparisons of the disaggregated COVID-19 data to population parameters are expected to have a critical role in designing health policies. The scientific community has already documented this need for demographically informed decisions, stressing the importance of interlinking the stringency and content of COVID-19 NPIs with key figures of the population 43 . For instance, school closures and curfew for individuals aged 65+ are expected to produce different outcomes depending on the structure of the population of interest. Moreover, sex and age disaggregated data are expected to guide crafting strategies for vaccination [44][45][46] . Last but not least, our dataset provides location details (for 83% of the cases) -see: the "district" variable, which coupled with the "confirmation date" records, could provide a spatiotemporal image for the first 250 days of the COVID-19 pandemic.
In conclusion, we argue that the present disaggregated dataset can significantly improve the accuracy and effectiveness of NPIs, especially in countries with low vaccinations rates. Moreover, we deem that the qualitative scale of stringency (Fig. 1, Table 1 and the corresponding main content) is sufficiently justified and detailed that future researchers could use this, to extract, or modify it for their own purpose of exploration. Also, the modelling of COVID-19 spreading in this micro-context may be performed by corroborating our dataset with the detailed evidence available with the COVID-19 Border Accountability Project (COBAP) 47 .

Code availability
No codes were developed for this research.