WNTRAC: AI Assisted Tracking of Non-pharmaceutical Interventions Implemented Worldwide for COVID-19

The Coronavirus disease 2019 (COVID-19) global pandemic has transformed almost every facet of human society throughout the world. Against an emerging, highly transmissible disease with no definitive treatment or vaccine, governments worldwide have implemented non-pharmaceutical intervention (NPI) to slow the spread of the virus. Examples of such interventions include community actions (e.g. school closures, restrictions on mass gatherings), individual actions (e.g. mask wearing, self-quarantine), and environmental actions (e.g. public facility cleaning). We present the Worldwide Non-pharmaceutical Interventions Tracker for COVID-19 (WNTRAC), a comprehensive dataset consisting of over 6,000 NPIs implemented worldwide since the start of the pandemic. WNTRAC covers NPIs implemented across 261 countries and territories, and classifies NPI measures into a taxonomy of sixteen NPI types. NPI measures are automatically extracted daily from Wikipedia articles using natural language processing techniques and manually validated to ensure accuracy and veracity. We hope that the dataset is valuable for policymakers, public health leaders, and researchers in modeling and analysis efforts for controlling the spread of COVID-19.


Background & Summary
The Coronavirus disease 2019 (COVID-19) pandemic has made an unprecedented impact on almost every facet of human civilization from healthcare systems, to economies and governments worldwide. As of August 2020, every country in the world has been affected, with more than 24M confirmed cases of infection and death toll approaching a million cases worldwide [1][2][3] . The pandemic has triggered a wide range of non-pharmaceutical intervention (NPI) responses across the world. With therapeutic and preventive interventions still in early stages of development, every country has resorted to NPI as a primary strategy 4,5 for disease control. Examples of such interventions include community actions (e.g. school closures, restrictions on mass gatherings), individual actions (e.g. mask wearing, self-quarantine), and environmental actions (e.g. public facility cleaning). Such NPIs vary significantly in their implementation based on the maturity of the health infrastructure, robustness of the economy and cultural values unique to the region.
Public health policy makers worldwide are striving to introduce successful intervention plans to manage the spread of disease while balancing the socio-economic impacts 6,7 . These initiatives will benefit from modeling the efficacy of different intervention strategies. The pandemic has sparked an ongoing surge of discovery and information sharing resulting in an unprecedented amount of data being published online 8 . This includes information about NPI measures, which are available in a wide variety of unstructured data sources, including official government websites 9,10 , press releases, social media, and news articles. However such modeling requires the information about the NPIs to be available in a structured form.
To address this urgent need, several data collection initiatives have emerged in the recent months resulting in several publicly available datasets with varying degrees of coverage, data freshness, and sparsity. For example, the CoronaNet dataset 11 contains the monadic and dyadic data on policy actions taken by governments across the world, manually curated by over 500 arXiv:2009.07057v4 [cs.CY] 4 Jan 2021 researchers covering sixteen NPI types and is kept fairly up-to-date. The Complexity Science Hub, Vienna enlisted researchers, students and volunteers to curate the Complexity Science Hub COVID-19 Control Strategies List 12 dataset, of eight different NPI types but covering only 57 countries. Similarly, the Oxford COVID-19 Government Response Tracker 13 dataset, takes a crowd-sourcing approach and covers 17 NPI types, 186 regions, 52 US states and territories. Because all these datasets are assembled manually, each of them is constrained in one or more respects: geographical scope, taxonomic richness, frequency of updates or granularity of details, and evidential sources. An AI-assisted, semi-automated data collection approach, driven by a rich, extensible taxonomy, can help overcome these issues and may result in a larger, frequently updated dataset with less manual labor. Wikipedia is one of the main sources of accessible information on the Internet. Since the start of COVID-19, a dedicated global network of volunteers has been creating, updating, and translating Wikipedia articles with vital information about the pandemic 14 . Over 5,000 new Wikipedia pages on COVID-19 have been written by more than 71,000 volunteers since the onset of the pandemic accumulating more than 440M page views by June 2020. Wikipedia articles, even though crowd-sourced, through the process of collective validation 15 and by citations of credible sources such as government websites, scientific literature, and news articles can serve as a reliable source of NPI data. Further, these Wikipedia articles are constantly updated; have been edited more than 793,000 times as of August 2020 making it both a rich and up-to-date source. Based on this, we postulated that an approach based on automated information extraction from Wikipedia, followed by human validation to ensure accuracy and veracity, would result in a frequently updated dataset with a wider coverage compared to any of the existing datasets. We present the result of our work, WNTRAC, a comprehensive dataset consisting of over 6,000 NPIs implemented worldwide since the start of the pandemic. WNTRAC covers NPIs implemented across 261 countries and territories, and classifies NPI measures into a taxonomy of sixteen NPI categories. NPI measures are automatically extracted daily from Wikipedia articles using natural language processing (NLP) techniques and manually validated to ensure accuracy and veracity.
In what follows, we explain the methods used to create the dataset, outline the challenges and key design choices, describe the format, provide an assessment of its quality and lay out our vision of how this dataset can be used by policy makers, public health leaders, and data scientists and researchers to support modeling and analysis efforts.

Methods
We built a semi-automated system to construct the dataset and keep it current. The NPI measures are modeled as events and evidences for information extraction purposes. This is illustrated by a motivating example shown in the Figure 2. Each event corresponds to an imposition or lifting of a particular NPI. An event is defined to be a 5-tuple (what, value, where, when, restriction), where 1. What: the type of NPI that was imposed or lifted. NPIs are grouped into sixteen major types. In the example, the type is school closure. 2. Value: sub-category or attribute that further qualifies the NPI type more specifically. In the example, the associated value is all schools closed. A detailed description of each type and the corresponding possible values is shown in Table 1. 3. Where: the region (country, territory, province, or state) in which the NPI measure has been implemented or withdrawn.
In this example, there are three distinct regions, namely, Punjab, Chhattisgarh, Manipur that are identified and three separate events will be extracted.

When:
The date from which the NPI was imposed or lifted. In the example, the date will be 13 March, corresponding to the implementation of the NPI, even if a likely date for the cancellation of the NPI, 31 March, is indicated. 5. Restriction: a flag indicating that the event corresponds to the introduction or withdrawal of the NPI. It should be noted that the lifting of the NPI is treated as a separate event. In the example, the restriction type is imposed. In addition to the mandatory fields described above, event contains one or more evidences. An evidence is a span of text extracted from Wikipedia that discusses a particular event. In the example, On 13 March, the Punjab, Chhattisgarh, and Manipur governments declared holidays in all schools and colleges till 31 March. is the evidence. An evidence may support more than one event. Each evidence is accompanied by a source type indicating the type of source of Wikipedia citation. More details about such additional attributes can be found in the data records section. The system, shown in the Figure 3, is designed to be scalable for continuous gathering, extraction and validation of NPI events. It consists of three subsystems: a data processing pipeline for capturing and extracting potential NPI events from Wikipedia articles, a tool called WNTRAC Curator for human validation of NPI events automatically extracted using the aforementioned pipeline and a data browser for visualizing the data. In the next section, we describe the system and its components at a high level, focusing on key design choices that have a bearing on the quality of the dataset, starting with a brief description of the data collection.

Processing Pipeline
The first step in the data processing is to retrieve the aforementioned list of Wikipedia articles on a periodic basis. The crawler module implements this functionality. It uses the MediaWiki API 17 for downloading the articles. As part of this step, we extract the text content of each article, while at the same time preserving all the associated citations. This process produces a document for each article. Each sentence in a document is a candidate for NPI extraction. As of August 2020, the aggregate crawled data contains over 55,000 sentences, with an average of 213 sentences per document. The second step in the pipeline is the extraction of the NPI events from a document. It is broken into a sequence of steps described below.
• Pre-processing: As the first step in processing a document, we use sentence boundary detection algorithms from libraries such as spaCy 18 , to identify where sentences begin and end. Although the sentences are used as logical units to extract NPI events, we preserved the order in which they appear in the source document for reasons detailed below. Also, at this step, we extract and retain the citation URL, if available for each sentence.
• Sentence classification: Next, we classify the sentence into one of the NPI types such as school closure to identify potential NPI events. If no NPI is discussed in the sentence, we classify it as discarded. We use multiple learning algorithms, including logistic regression, Support Vector Machines, and Bidirectional Encoder Representations from Transformers (BERT) 19 , and employ an ensemble method to obtain better overall predictive performance. A small subset of the data (1490 sentences), was manually annotated to train the models. Independently, we also categorize the sentence as implying either the introduction or the withdrawal of an NPI (restriction).
• Named entity recognition and named entity disambiguation: After we identify the potential events in the previous step, we extract specific constituent entities for each candidate event from the sentence. We used state-of-the-art namedentity recognizers (such as spaCy 18 ) and normalizers to detect and normalize locations (Where : [Punjab, Chattisgarh, Manipal ]) and time expressions (When : March 13). In addition, we also link the location entities of type 'GPE' in the Wikipedia article title to the corresponding ISO codes 20,21 . Even though we use the sentence as a logical unit for the extraction of an NPI event, the sentence itself may not include all the relevant information. For example, date or location may be available in sentences in the vicinity or in the header of the paragraph to which the sentence belongs. To address this key challenge, we developed a heuristic-based relation detection algorithm to associate one of the extracted dates or locations from the current document to each sentence.
• Value extraction: The last step in NPI event extraction, is determining the associated value. We use multiple rule-based algorithms that either operate independently or depend on information extracted by the previous steps. For example, given the sentence "On 13 March, it was announced at an official press conference that a four-week ban on public gatherings of more than 100.", the event type is mass gathering and the associated value is maximum number of people in social-gathering allowed by the government. The value extraction is performed using parse-based rule engines 18 . It is worth noting that the value extraction components should know the actual type mass gatherings before extracting the correct value "100". Similarly, given a sentence "On 1 April, the Government of USA suspended flights from New York to Texas", the event type is domestic flight restriction and the associated value is name of the state where the passenger is arriving from. To correctly extract the value, the value extraction needs to know the correct type and normalized locations ("New York") respectively.
Thus, using the above procedure, we extract the unique 5-tuples that are the candidate NPI events. Once extracted, they are presented to the volunteers for validation to ensure data quality. This process is repeated every day. In order to minimize manual labor, considering the small number of volunteers, we attempt to detect changes since the last time we crawled Wikipedia. We use a combination of syntactic similarity metrics such as Levenshtein Norm, and semantic similarity metrics such as event attribute matching to perform this daily change detection for each extracted document.

WNTRAC Curator
The events automatically extracted from the pipeline are vetted by volunteers using the WNTRAC Curator validation tool. The tool is a simple web-application backed by a database as shown in Figure 3. The tool is shown in Figure 4. At the top, it displays the complete Wikipedia document extracted by the processing pipeline. Below the document, each candidate event is shown to the volunteer in separate cards. The volunteer can adjudge the candidate event to be a brand new NPI event or an evidence to an existing event or discard the candidate. They can also correct any of the attributes associated with the event extracted by the pipeline.  Figure 5 presents an interactive data browser 22 that uses a chart, map, and histogram to provide a descriptive analyses of NPIs and COVID-19 outcomes such as confirmed cases and deaths. The browser has a control panel used to filter the data being visualized (e.g cases vs deaths), as well as how it is visualized (e.g. linear vs log scale). A play slider can be used to view the temporal evolution of NPIs and COVID-19 outcomes in a given region. The chart illustrates the time points in which a geographical region imposes or lifts an NPI along with the temporal trends of COVID-19 outcomes. The different types of NPIs are illustrated using specific icons that are described in a legend. Groups of interventions are noted with the star icon. The number of countries/territories and the number of NPIs shown in the chart can be adjusted in the settings. The user can select a specific line on the chart referring to a territory to focus on the NPIs imposed and lifted in that location. The histogram below 6/15 the chart shows the number of territories that have imposed the different types of NPIs and can be selected to see the territories on the map that have imposed the selected subset of NPIs. The map illustrates the proportion of NPI categories (out of the 16 NPI categories in the dataset) implemented in each region using a gray-colored bar. Furthermore, when a region is selected, the gray-colored bar in any other region illustrates the proportion of NPI categories in the other region as a proportion of NPI categories implemented in the selected region. The map is also used to visualize the geographic distribution of the selected COVID-19 outcome using choropleth, spikes, or bubbles. The user can interact with the territories on the map to focus on a location and view the data on the chart. Note that for some countries such as the United States, the map can be zoomed to reveal finer-grained data for sub-regions such as states.

Data Records
In addition to the key fields discussed earlier, the dataset also contains a few additional attributes for each event. A complete listing of all fields across event and evidence is shown in Table 2, along with an example for each field. Each version of the dataset consists of two CSV files named ibm-wntrac-yyyy-mm-dd-events.csv and ibm-wntrac-yyyy-mm-dd-evidences.csv, corresponding to events and evidences respectively. A live version of dataset is available in our GitHub repository https: //github.com/IBM/wntrac/tree/master/data for download. The dataset is regularly updated. At the time of the submission, the dataset is updated as of October 13 th , 2020. Historical versions of the dataset are made available in the same GitHub repository. Further, a static copy of the dataset containing NPIs recorded as of 8 th July 2020, used for the technical validation in the paper has been archived in figshare 23 . In the next section, we include some high-level dataset statistics to provide a sense of the distribution of the data.    Figure 6 shows the distribution of the NPI measures imposed worldwide. Entertainment / cultural sector closure, confinement and school closure are the predominant NPIs taken by governments 2 . Figure 7 summarizes the overall total number of regions that implemented NPIs of each type. As shown in the graph confinement, school closure and freedom of movement are the most common NPIs imposed worldwide, as expected from Figure 6. Figure 8 shows the breakdown of the NPIs within each region, for the top twenty regions that have implemented the highest number of NPIs measures.

Technical Validation
The validation team consisted of a mix of experts who participated in the design of the taxonomy and/or the pipeline and IBM volunteers who completed a brief training session about the annotation schema and tool. Validation was done in two stages. In the first phase, because the WNTRAC tool was still being developed, we used simple CSV files to distribute the data for validation. Each annotator was given a complete document corresponding to a Wikipedia article for a particular region,

8/15
retrieved as on June 6, 2020, pre-annotated with the output of the pipeline. Each sentence was displayed in a separate line with sentences corresponding to candidate events highlighted with a different background color. The attributes extracted by the pipeline were listed next to each sentence. Annotators were asked to verify and correct each of these attributes. If a sentence does not discuss any of the valid event types, they were asked to mark the type as discarded. If a sentence was incorrectly discarded by the pipeline, they were asked to correct the type and fill in the attributes when possible. This was, however, not uniformly enforced. In the second phase, we made WNTRAC Curator tool available to the annotators. The tool randomly assigns a single document to be validated to each annotator. Each document, consists of incremental changes to the underlying Wikipedia article since the last validation of the document. The validation process for the second phase is similar to the first phase except that only candidate events, as determined by the pipeline were shown to the annotators. This time-saving move was based on the observation during the first phase, when all sentences were presented, human annotators generally agreed with the automated pipeline on discarded sentences. The NLP model used a recall-oriented threshold and only discarded sentences with low scores on all valid NPI types.  Table 3. Inter-annotator agreement between average volunteers (A) and two groups of experienced volunteers (E 1 and E 2 ). Region includes both country and state/territories as applicable.
To determine the quality of the dataset post validation, inter-annotator agreement (IAA) was calculated on a subset, randomly sampled (2%), from the full set that was validated by IBM volunteers. Each instance in the subset was further double annotated by two experts (randomly selected from a pool of six experts) independently, resulting in three sets of annotations per instance. The IAA was evaluated on all five fields of the 5-tuple that uniquely defines an event. Furthermore, the evaluation was performed at a field level for all fields except the value, which is technically a sub-field of type and it does not make sense to be analyzed on its own. The IAA results are shown in Table 3. Note that the IAA between experts were consistently high in all categories, indicating that the annotation schema is not ambiguous and most sentences can be consistently assigned to one of the NPI type defined in the taxonomy. The IAA between the volunteers and experts were also good (0.58) at the NPI type level and the agreement is high (0.81) in the five most frequent NPI types. We plan to expand the taxonomy over time to cover more NPI types. We also plan to improve the accuracy of the pipeline by using end-to-end entity linking techniques for entity normalization and state-of-the-art methods for better temporal alignment. We plan to expand to other data sources to improve coverage.

Usage Notes
One of the primary objectives in creating the WNTRAC dataset was to understand what types of NPIs are being implemented worldwide and to facilitate analysis of the efficacy of the different types of NPIs. Specifically, the dataset supports a variety of studies, such as correlation and analysis to understand the associations between NPIs and outcomes, causal inference between NPIs and specific outcome variables, as well as impact analysis to understand the impact on socio-economic factors. Furthermore, this dataset offers an opportunity to perform local contextualized What-if scenarios and optimal intervention planning, by incorporating NPIs into epidemiological models. Such capabilities are critical for target decision-making to control the spread of the disease and minimize impact on society.
There are a number of questions, ranging in complexity, that the dataset can be used to answer. For example, consider the question: How many NPIs were imposed and lifted globally as the pandemic continues?. Figure 9 sums the number of NPIs imposed and lifted in all geographies per month. As expected the vast amount of NPIs were imposed during the first outbreak of COVID-19 in March, and lifted mainly in April and May. This figure also reveals the imbalance between imposed and lifted NPIs that exists in the data. For example, while more than three thousand NPIs were imposed at March, less than five hundred were lifted between April and September. The imbalance can be the outcome of many factors, such as, how and when lifting of NPIs is announced over time. Such factors should be taken into account performing analysis using this dataset. A second example use of the dataset is to explore which NPIs were imposed by different countries early in the pandemic, to contain the spread of COVID-19?. One approach is to break the set of NPIs into two sets: travel-related and community related. Travel-related NPIs include domestic flight restrictions, international flight restrictions, freedom of movement (nationality dependent), and introduction of travel quarantine policies. Figure 10a visualizes the elapsed time between the implementation of a travel-related NPIs and the recording of at least 50 cases, and time to the first reported death. The visualization shows 9 selected regions each of which had at least one travel-related NPI among the first set of NPIs imposed in the country, and was generated by combining WNTRAC dataset with COVID-19 outcomes dataset from the World Health Organization (WHO) 2 . For each region, the blue bar plot illustrates the number of days before 50 cumulative cases, and the red points shows the number of days before the first death. From the graph, it can be observed that Singapore first imposed a travel-related NPI more than 50 days before their first death, showing an earlier response than Brazil and New York State where the first travel related NPI were imposed about 10 days after the first death. Similarly, Figure 10b visualizes the elapsed time between the implementation of community-related NPIs and the recording of at least 50 cases and at least one death for 9 selected regions. The community-related NPIs include entertainment/cultural sector closure, confinement, school closure, mass gatherings, mask wearing, public services closure, public transportation, work restrictions, and state of emergency. It can be noted that at least one community-related NPI was imposed for each of the selected regions prior to their first recorded death due to COVID-19. As a third example, we demonstrate how the WNTRAC dataset can be used to generate an index, a summary statistic between [0, 1] that represents the NPIs imposed and, if available, the adherence. This index can be used to study the relationship between NPIs and COVID-19 outcomes over time and to compare response strategies across jurisdictions. Figure 11 illustrates this using data from representative states in the United States (Florida, Georgia, New York, and Texas). In the figure, the bar graph shows the trend for the exponentially weighted moving average of new cases per 100,000 population. The red continuous line is the proportion of the NPI (out of thirteen NPI types in the WNTRAC dataset) that a region has imposed at a given time. The blue continuous line is the WNTRAC NPI Index, a composite index that captures both the stringency levels of the NPIs and community mobility data as a proxy measure of adherence to NPIs strategies. The WNTRAC NPI Index, denoted η(t), is presented in Eq. 1, and the code for the WNTRAC NPI Index is available in the repository.
ω 0 , ω 1 > 0 are weights applied to each term and ω 0 + ω1 = 1. Specifically, the first term, SI, is derived from mapping and scoring the WNTRAC NPI similarly as presented in the Oxford COVID-19 Government Response Tracker (OxCGRT) stringency index 13 . The second term represents adherence at a specific point in time, A(t), by using mobility data as a proxy. Specifically, we define A(t) in Eq. 2 as a function of the "anticipated mobility", m ant , and the "observed mobility," m obs . The anticipated mobility at a specific point in time is the mobility score that would potentially be associated with the NPIs at that time. The observed mobility is the mobility value observed in that region at a specific time point and ideally should be close the value of anticipated mobility. In our work, we assume a negative relations between stringency and mobility, and anticipated mobility is derived from this linear relationship with noise.
As illustrated, the WNTRAC NPI metrics can be compared to existing metrics such as the OxCGRT stringency index 13 . Of note is the detailed interpretation of the relationships illustrated in this example is subject to addressing limitations such as missing data and will be pursued as part of our future work. Finally, another important application of the WNTRAC dataset is to support What-if analysis and decision-making for optimal intervention planning. This is especially important to provide critical, time-sensitive decision support to various leaders, and decision-making teams such as COVID-19 task force teams as they determine which NPIs to impose or lift over time. Efficiency in this decision-making process is important, as the space of all potential combinations and variations of NPIs is large and complex. The options for a particular region have varying degrees of impact on outcomes for that region. Tools 26

11/15
that enable what-if analysis and intervention planning, at both national and sub-national levels, that incorporate the WNTRAC dataset can be leveraged to meet this need. For decision-makers, these tools enable easy navigation through the complex intervention space in a timely manner to generate the most optimal and context-relevant COVID-19 intervention programs. A key requirement for such tools are epidemiological models that are calibrated in such a way that the resulting forecasts can be trusted as accurate projections. To calibrate these models, it is critical to consider the NPI that have been imposed so that the drivers of disease spread can be contextualized for a region. By incorporating NPI into the models improved projections of outcomes of the disease can be generated, yielding more accurate scenarios for decision-makers to explore.
In addition to the above examples, the WNTRAC dataset can be used to support other objectives, including estimating the relationships between NPIs and • consumers behavior by, for example, correlating between retail data and NPIs.
• environmental changes such as pollution levels.
• actual compliance by the population. Naturally, not all the interventions recorded in the dataset are an accurate representation of reality as some of the interventions capture a governmental request that might not be followed by the entire population. Thus, it might be useful to integrate the WNTRAC dataset with other publicly available data sources that can provide information regarding the level of compliance with an intervention, such as mobility information 27, 28 , where we provided an example with the NPI-Index above, and social media. Lastly, one other interesting use case is to estimate the economic impact of NPIs by, for example, relating unemployment rates and jurisdictional debt with NPIs. Estimation of the effect of NPIs on non-COVID-19 health problems, such as late cancer detection due to missed screening tests, will also be useful.

Code Availability
The source code for the WNTRAC automated NPI curation system, including the data processing pipeline, WNTRAC Curator tool and NPI data browser is available in a public GitHub repository at https://github.com/IBM/wntrac/tree/master/code along side the up-to-date version of the dataset https://github.com/IBM/wntrac/tree/master/data. Please refer to the README file in repository for further instructions on using the code.  14/15