A structured open dataset of government interventions in response to COVID-19

In response to the COVID-19 pandemic, governments have implemented a wide range of non-pharmaceutical interventions (NPIs). Monitoring and documenting government strategies during the COVID-19 crisis is crucial to understand the progression of the epidemic. Following a content analysis strategy of existing public information sources, we developed a specific hierarchical coding scheme for NPIs. We generated a comprehensive structured dataset of government interventions and their respective timelines of implementation. To improve transparency and motivate collaborative validation process, information sources are shared via an open library. We also provide codes that enable users to visualise the dataset. Standardization and structure of the dataset facilitate inter-country comparison and the assessment of the impacts of different NPI categories on the epidemic parameters, population health indicators, the economy, and human rights, among others. This dataset provides an in-depth insight of the government strategies and can be a valuable tool for developing relevant preparedness plans for pandemic. We intend to further develop and update this dataset until the end of December 2020.


Background & Summary
Non-pharmaceutical interventions (NPIs), also known as public health and social measures (PHSM) 1 , aim to prevent the introduction of infectious diseases (preparedness and readiness measures), control their spread and reduce their burden on the health system (control measures). The general concept of containing the initial (exponential) spread of a disease is called "flattening the (epi-)curve" 2 . By reducing the growth rate of an epidemic, NPIs reduce the stress on the healthcare system and help gaining time to develop and produce vaccines and specific medications, which is of utmost importance in the case of emerging infectious diseases.
During the COVID-19 pandemic, governments have enforced a broad spectrum of interventions, under rapidly changing, unprecedented circumstances. Government responses to COVID-19 included the laissez-faire strategy, which implies doing little to nothing, the herd immunity strategy, which implies a few measures only or measures relying on voluntary compliance, and more aggressive approaches based on the implementation of a wide range of stringent NPIs, sometimes even limiting civil rights and liberty 3,4 . Government control policies have shown divergences in particular in the timeline of implementation and in the prioritization of the NPIs. In China for example, quarantine, social distancing, cordon sanitaire, and isolation of cases have been associated with improvements in the key epidemiological markers, including the number of infections and COVID-19-related deaths 5 . In Hong Kong and Taiwan, which experienced severe acute respiratory syndrome (SARS) epidemics in 2002-2003 6,7 , early government actions, strict social distancing measures, contact tracing, extensive and proactive testing, and high compliance of the population, have, to date, successfully mitigated the COVID-19 epidemic 8,9 . Following a herd immunity approach, similar to the one initially adopted by the UK government, the Swedish government did not introduce strict bans but formulated non-binding recommendations only (https:// www.folkhalsomyndigheten.se/nyheter-och-press/). Predictive models, however, suggest that such a strategy might ultimately overwhelm the healthcare system 10 .
Poor control policies have potentially dramatic repercussions on public health. Although the need for data on country-based responses to COVID-19 was urgent and is still crucial, there is a limited opportunity to capture this information. Started in mid-March 2020, our project aims to generate a comprehensive structured dataset on government responses to COVID-19, including the respective time schedules of their implementation.
In the context of the current COVID-19 health crisis, open knowledge 11 and data sharing are crucial to understand and help to mitigate the pandemic. In this article, we document and share the methodologies, tools and approaches used to produce the Complexity Science Hub COVID-19 Control Strategies List (CCCSL) dataset following the principles of open science. We provide a detailed description of the dataset and present examples of how it can provide insights into the global government response to COVID-19.
The dataset is readily usable for modelling and machine learning analyses and exhibits a great analytical flexibility 12 . In particular, researchers have leveraged on the hierarchical structure and the granularity of the data to disentangle the individual impacts of the NPIs on the reduction of the effective reproduction number through a top-down approach (from theme to code). Results show that social distancing measures, travel restrictions, but also active risk communication, play a major role in containing the epidemic. The study further distinguishes the impact of different levels of implementation of some specific measures, e.g. those related to face covering 12 .
Considering the imperative necessity for data on government interventions, we released version 1 of the dataset on 2 April 2020. Version 2, displaying a consolidated coding scheme, is available since 7 May 2020. We also provide user-friendly documentation and materials (codes, visualisation interface, and library of sources) along with the dataset, which allow a maximum understanding of the data and promote its use among non-experts. The dataset is not complete and we continuously update it with new available records. Depending on resources, updates are planned until the end of December 2020.

Methods
We used a content analysis 13-15 strategy of existing information sources to develop a hierarchical coding scheme specific to NPIs implemented to mitigate the burden of COVID-19. First, based on a literature review on community mitigation strategies and expert knowledge, eight themes (thereafter called level 1 (L1) in the coding scheme) were identified and labelled: (i) Case identification, contact tracing and related measures, (ii) Environmental measures, (iii) Healthcare and public health capacity, (iv) Resource allocation, (v) Risk communication, (vi) Social distancing, (vii) Travel restriction, and (viii) Returning to normal life. A definition for each theme is provided in the Online-only Table 1. At the start of our project, there were no previously published studies on NPIs against COVID-19 to be used as a reference for developing the labelling and coding scheme. Therefore, a list of NPIs that have been already implemented by different governments at this time (mid-March 2020) was compiled, that served as a preliminary template to generate a priori categories within a hierarchical coding scheme. Strategies that could provide assistance to the population (e.g., related to financial support or food supply) or that may encourage compliance with the measures (e.g. resource allocations, risk communication) were also included. Listed interventions were then assigned to one of the eight themes defined above. The specific details and descriptions of each NPI were coded into a priori categories (thereafter called level 2 (L2) in the coding scheme), and into subsequent a priori subcategories and codes whenever needed (thereafter called level 3 (L3) and level 4 (L4) in the coding scheme, respectively). Discrepancies in code assignments were discussed within the coding team www.nature.com/scientificdata www.nature.com/scientificdata/ and were resolved by consensus. The objective of this hierarchical coding scheme for NPIs was to standardize the data collection and obtain a structured dataset that uses a consistent taxonomy, and therefore, promotes common understanding.
On 19 March 2020, we set up a platform for students, researchers, and volunteers to collect data on the NPIs implemented by the governments for preventing and limiting the spread of COVID-19, including the time schedules for the implementation. Data collectors received clear instructions on the objective of the project and indications on how to proceed for data collection. Data collectors were asked to use the template of a priori themes, categories, subcategories, and codes or to refer to the data curators if a measure could not be coded using this a priori coding system. Therefore, throughout the data collection process, new categories, subcategories, and codes emerged, derived directly from the text data sources. The emergent (inductive) categories and subcategories were openly coded by the data collectors or by the data curators. In a second step, inductive categories and subcategories were compared together and in relation to the entire dataset to detect co-occurrences (codes that partially or completely overlap) and redundancies. Codes with the same meaning were aggregated 16 . The categories and subcategories were tightened up to the point that maximized mutual exclusivity and exhaustiveness 15 . This resulted in a Master List of Codes (a list of all the codes that were developed and used in the study), including the curated a priori and inductive coding categories. The Master List replaced the a priori template for categorisation of the measures during data collection. It was shared with the data collectors via a Google spreadsheet and updated daily.
Different public sources were used to populate, update and curate the dataset, including official government sources, peer-reviewed and non-peer-reviewed scientific papers, webpages of public health institutions (World Health Organization, Centers for Disease Control and Prevention, and European Centre for Disease Prevention and Control), press releases, newspaper articles, and government communication through social media. We collected data on the following: (i) country, (ii) state/region (when measures were implemented at subnational-level), (iii) date of implementation of the measure, (iv) implemented measure coded following the four-level classification scheme described above (theme, category, subcategory and code), and (v) source. For each country, data were preferentially collected in the language of the country by native data collectors (i.e. Austria, Belgium, Bosnia and Herzegovina, Brazil, Canada, Croatia, Czech Republic, Ecuador, El Salvador, France, Germany, Ghana, Honduras, Hong Kong, India, Italy, Kazakhstan, Kosovo, Kuwait, Mauritius, Mexico, Montenegro, North Macedonia, New Zealand, Poland, Portugal, Ireland, Romania, Senegal, Serbia, Spain, Syria, Taiwan, and United Kingdom). If this was not possible, Google Translate was used to translate documents 17 . All records were hand-coded.

Data Records
A static copy of the dataset has been archived in figshare 18 , including all NPIs recorded as of time of submission (17 July 2020), spanning the period 31 December 2019 to 15 July 2020. A dynamic version of the dataset, which is planned to be continually updated, can be accessed via GitHub: https://github.com/amel-github/ covid19-interventionmeasures or from Google Drive: https://drive.google.com/open?id=1041U8iWPDS-GI6KHIn9Dg7THkXIo3-gui, in CSV format. Each of the rows represents a single individual NPI and is identified by a unique ID. The Master List of Codes is also available (an additional Master List file displays the hierarchical relationship between each pair of parent/child codes, i.e. L1-L2, L2-L3, and L3-L4, and the number of times each pair occurs in the dataset). We also provide a Glossary of Codes, which gives the definition of each theme, category, subcategory, and code. An online interactive tool, which enables to visualise the dataset hierarchical structure and codes, completes the description of the dataset. It is accessible at: http://covid19-interventions. com/CCCSLgraph/. We have also established a GitHub repository available at: https://github.com/amel-github/ CCCSL-Codes and provide codes 19 for importing, exploring and visualising the data into R 20 . Furthermore, for purposes of transparency of data collection and to motivate collaborative validation process as well as a large use and development of the dataset, an open library is available, that contains all sources used to collect the data: https://www.zotero.org/groups/2488884/cccsl_covid_measure_project (>3,100 data sources are included as of date of submission). In order to leverage on the potential of crowdsourcing for populating and curating the CCCSL dataset, we have launched a webpage dedicated to this project at: http://covid19-interventions.com/ where contributors can fill up a Google Form at: https://bit.ly/2KsYOTn, if they wish to correct entries, add a measure, and/or provide a feedback.
The dataset contains the following fields: ID -Unique identifier for each individually implemented measure. ID is also used in the Google Form to report erroneous entries. www.nature.com/scientificdata www.nature.com/scientificdata/ L3_Measure -Subcategory (L3 of the classification scheme). Provides detailed information on the corresponding category (L2). L4_Measure -Code (L4 of the classification scheme). Corresponds to the finest level of description of the measure. Status -Indicates whether the measure is a prolongation of a previously implemented measure ("Extended") or not (""). Comment -Provides the description of the measure as found in the text data source, translated into English. This field allows to judge the quality of the label for the different levels of the coding scheme and enables to re-assign the measure to the correct theme/category/subcategory/code in case of error or misinterpretation by the data collector 21 . When available, duration of the restriction, as officially announced, is mentioned in this field. Source -Provides the reference for each entry, i.e. URL. Enables to trace back potential changes in the meaning of the label during the translation 21 . Enables to access the description of the measure in the source language and/or to access to the information as it was dispatched originally.
As of date of submission, the CCCSL dataset included information for 6,068 government interventions, from 56 countries, including 33 European countries, 12 Asian countries, five South American countries, two North American countries, one Oceanian country, three African countries, and the Diamond Princess cruise ship. Regarding the USA, data are available at the state level for 24 states. Figure 1, Table 1, and Online-only Table 2 summarize the dataset. A description of the measures grouped by theme (L1) for each country can be computed from the published codes 19 (https://github.com/amel-github/CCCSL-Codes).

Technical Validation
After the initial data entry, the dataset was checked manually by the data curators. For each measure, concordance between L1, L2, L3, and L4 was checked. Moreover, the unique combinations of L1, L2, L3, and L4 were extracted and controlled for consistency. Typographical and coding errors were minimized through a manual process. We initiated a collaborative curation platform relying on internal and external collaborators who exchanged through Slack, GitHub, Skype, and via emails. This extended effort enabled us to correct typographical and coding errors, to remove line breaks, and to homogenize the dataset for universal use in different programming languages.
Beyond manual validations, we performed a technical validation step to detect possible duplicates. Using the dplyr package for the R Programming Language 22 , we identified any duplicate entries in the vector composed of country, region, date, and the codes from L1 to L4. Those entries were flagged as possible duplicates and reviewed by hand by two curators, ensuring that the dataset does not contain duplicated entries. An R script to reproduce this step is provided at: https://github.com/amel-github/CCCSL-Codes.
While an important effort has been made for standardizing the records, the four-level-a priori coding scheme originally proposed showed limitations. First, the existing classifications of NPIs are discordant [23][24][25] . We proposed an original classification scheme that best fitted our (emergency) needs and the specificity of the COVID-19 pandemic, but this scheme may be subjected to revisions in the future. Secondly, some NPIs have been uniquely implemented (e.g. the deportation of Chinese workers by the Kazakh government), which complicated the coding and categorisation process.
Access to the information from government or other official sources may be compromised if not performed timely. Indeed, several governments or national health agencies regularly update their webpage to provide the latest information to the public. Therefore, if sources are not consulted timely, previous content (i.e. previous restrictions and measures) might not be visible straight away and data will have to be retrieved indirectly or from archived websites, which eventually slows down the data collection process and may lead to missing data. Furthermore, while native speakers were recruited whenever possible for data collection, transliteration or translation errors may have occurred when extracting data from Google Translate translations.
Lastly, when using the data for epidemiologic or economic modelling, the absence of an "End date" data element might be a limitation. However, this data cannot be captured for each kind of NPI, e.g. "Increase of healthcare workforce" or "Work safety protocol". We propose an alternative approach that leverages on the theme "Returning to normal life" and record individually all (i) variations in, (ii) conditions of, and (iii) adaptive measures to the gradual lifting of the restrictions (e.g. re-opening of shops > 400 m², re-opening of classes with examination, weddings allowed if the number of attendees is < 100, etc.). By providing data on each step of the phase-out process, the coding scheme allows therefore to retrieve even more specifically (but indirectly) the "End date" for each NPI (to the best of our knowledge, only the CoronaNet dataset provides a "End date" data element, although as of date of writing, for 30% of the interventions only 26 ).
We plan to maintain the quality level of the dataset with regular updates on the countries currently described. Furthermore, we plan to increase the geographic coverage of the dataset, prioritizing large countries (e.g. China, US states not already covered, and Australia), those with a high number of reported cases (e.g. Vietnam, Iran, Turkey, Russia, Israel, Peru, Chile, Pakistan, Philippines, Saudi Arabia), and those where the epidemic is rising and which may suffer from a data gap (i.e. African and South American countries). The same technical procedures and the classification scheme described above will be applied to any new information to be included in the dataset. Future versions will be subjected to extensive data validation processes. We plan to stabilise the hierarchical coding scheme for NPIs implemented to contain COVID-19 within six months, including measures related to the lifting of the restrictions and adaptive measures that accompany them.

Usage Notes
The aim of this work is not only to improve the current knowledge on country-based interventions implemented to mitigate the burden of COVID-19, but also to characterise the political, public health, and economic strategies of the governments worldwide. Combined with publicly available data on the number of confirmed cases, recovered cases, and deaths, the CCCSL dataset makes it possible to assess the effectiveness of the control policies on the COVID-19 epidemic, e.g. the epidemic growth rate or the daily reproduction numbers 12 . The standardized coding facilitates an inter-country comparison of government responses. The dataset can further benefit the risk assessment of lifting some restrictions and the development of exit strategies. It can also become an essential data source in the aftermath of the first wave of COVID-19, to guide government control policies anticipating a potential second wave of cases. We envision the CCCSL dataset to become a timely valuable and long-lasting data source for assessing the impact of the NPIs on global public health indicators, the economy, and human rights, among others. We provide below two examples of data usages that give an insight into the responsiveness and aggressiveness of the governments in their management of the COVID-19 crisis.
Mapping the timeline of government interventions during the epidemic. We propose to visualise the time-series of the dates of implementation of the NPIs recorded in the CCCSL at the level 2 of the hierarchical coding scheme (categories) in the 56 countries using a heat map (Fig. 2). In order to highlight country-based differences in the timeline of implementation, we used the epidemic age instead of calendar time. For a given day, t, in a certain country, the epidemic age is defined as the time difference, t-t 0 , measured in days, where t 0 is the first day when the number of confirmed cases was greater or equal to 10. The time-series data of the number of COVID-19 cases was retrieved from the COVID-19 Data Repository by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) at: https://github.com/CSSEGISandData/COVID- 19. www.nature.com/scientificdata www.nature.com/scientificdata/ Country-cluster analysis of the government control strategies. In order to partition the countries based on the aggressiveness (number of NPIs) and responsiveness (timeline) of their control strategy, we applied a k-means clustering. We focused on mandatory government interventions (i.e. the theme "Risk communication" was not included) recorded in the CCCSL at the level 2 of the hierarchical coding scheme (categories) that appeared in at least 15 countries, leading to a total number of 40 categories. The clustering algorithm uses the date of implementation of the measures in each country to build a feature vector based on the epidemic age (see above). We considered "anticipatory measures" as those implemented before day when 10 cases were reported; "early measures" as those implemented at the beginning of the epidemic, i.e. between the day when 10 cases were reported and the day when 200 cases were reported; and "late measures" as those implemented at a later stage of the epidemic, i.e. after the day when 200 cases were reported. The algorithm takes also into account the number of measures implemented at these different stages of the epidemic. The time-series data of the number of COVID-19 cases was retrieved from the COVID-19 Data Repository by the JHU CSSE (https://github.com/ CSSEGISandData/COVID-19). The optimal number of clusters, k, was determined using the elbow method 27 . Briefly, this method consists in running k-means clustering on the dataset for a range of values of k (set here from 1 to 15), and for each value of k calculates the sum of squared errors (SSE). We then plotted the SSE for each value of k and identified the best value of k where the line chart looks like an arm ("elbow"). As of date of publication (static version 2020-07-12, 56 countries) the best value of k was eight, explaining 82.8% of the variance (Fig. 3). An interactive version of Fig. 3 is available online at: http://covid19-interventions.com/CountryClusters.html.
Contextualising the project. During the COVID-19 crisis, other projects have concomitantly tracked data on government policies (interchangeably named NPIs 18 or government(s') responses 26,28,29 or government measures 30 or PHSM 31 or policy actions 26 ). We report here on five of them 26,[28][29][30][31] in order to contextualise our work. The comparison indicates similarities and differences among the NPI trackers and highlights how the CCCSL www.nature.com/scientificdata www.nature.com/scientificdata/ contributes to the global effort against COVID-19. Supplementary Information 1 outlines the main characteristics of the six datasets (including the CCCSL 18 ).
The core value-added of the CCCSL dataset is the remarkable granularity of the data on NPIs (e.g. seven categories of travel restriction are reported, further divided into more than 50 subcategories) and the use of self-explanatory codes, which, completed with the Glossary of Codes, makes the dataset readily intelligible. As of date of submission, the dataset displays eight themes, 63 categories, >500 subcategories, and >2,000 codes.
With regard to the geographic unit, two datasets record data at the country level 28,29 whereas four record data at a finer administrative scale 18,26,30,31 . One dataset uses a binary code (1/0) to assess the presence/absence of the NPIs 28 , another one uses an Likert-like scale to further differentiate the level of implementation 29 , whereas the others use a coding system based on words or short phrases that assign a summative attribute to the data 18,26,30,31 . Moreover, the aggregation scheme and, sometimes, the semantic of the NPIs diverge widely between the datasets. For example, the CoronaNet dataset 26 groups school closure together with lockdown measures whereas the CCCSL 18 and the ACAPS 30 datasets classify school closure in the theme "Social distancing". Regarding the restriction on individual movement, this measure is labelled "Partial lockdown" in the ACAPS dataset 30 , "Household confinement" in the HIT-COVID dataset 31 , "Lockdown applies to all people" in the CoronaNet dataset 26 , and "Movements for non-essential activities forbidden" in the CCCSL dataset 18 . Overall, these projects are independent of each other and the specific research question should indicate which one(s) to use. Harmonizing and integrating the different datasets could help accelerate epidemiological understanding on COVID-19 and the development of relevant preparedness plans for pandemic. The World Health Organization is currently making an important effort in this regard 1 .

Code availability
A live version of this project is accessible on GitHub at: https://github.com/amel-github/covid19interventionmeasures. The codes used to describe the CCCSL dataset and the codes used to explore the CCCSL dataset are written in R language 19 . They are available at: https://github.com/amel-github/CCCSL-Codes. Please refer to the README file in the code release for further instructions. www.nature.com/scientificdata www.nature.com/scientificdata/