OxCOVID19 Database, a multimodal data repository for better understanding the global impact of COVID-19

Oxford COVID-19 Database (OxCOVID19 Database) is a comprehensive source of information related to the COVID-19 pandemic. This relational database contains time-series data on epidemiology, government responses, mobility, weather and more across time and space for all countries at the national level, and for more than 50 countries at the regional level. It is curated from a variety of (wherever available) official sources. Its purpose is to facilitate the analysis of the spread of SARS-CoV-2 virus and to assess the effects of non-pharmaceutical interventions to reduce the impact of the pandemic. Our database is a freely available, daily updated tool that provides unified and granular information across geographical regions. Design type Data integration objective Measurement(s) Coronavirus infectious disease, viral epidemiology Technology type(s) Digital curation Factor types(s) Sample characteristic(s) Homo sapiens

Characterising the impact of the COVID-19 pandemic and understanding the efficacy of policy interventions requires a comprehensive, well-formatted and easily accessible database. The World Health Organisation and the European Centre for Disease Prevention and Control collect daily statistics about cases and deaths from governmental sources. These aggregated databases are widely used in research but they lack granularity and context. In addition, several academic institutions have curated high quality data sets aiming at capturing variables not included in the aforementioned databases: the Coronavirus Resource Center at John Hopkins University 1 , the Real-time Case Tracker 2 , The Economist's Tracker for COVID-19 Excess Deaths 3 and the Oxford COVID-19 Government Response Tracker 4 . In constrained research contexts related to the pandemic, these databases prove to be immensely useful to researchers and policy-makers seeking to understand both the causes of the spread and the efficacy of public health interventions. Linking such heterogeneous data is vital to understanding the context which gave rise to the observations and to making inferences at a finer spatial resolution. However, the process of linking relevant data across these sources is complex and requires great care. The OxCOVID19 Database aims to link different modalities of data, reported at the national and regional level, including epidemiological information on COVID-19 (confirmed, deaths, recovered, hospitalised, etc.), government response (school closing, economic measures, etc.), mobility (e.g., change in mobility trends of humans in various places), weather (e.g., temperature, humidity, precipitation, etc.), socioeconomic statistics www.nature.com/scientificreports/ and value surveys (Fig. 1). The database uses an established spatial index GID 5 , which spans several administrative levels. Wherever possible, the OxCOVID19 Database draws upon official government sources, work by university-based or government research groups and data from peer-reviewed scientific papers. The data are provided with the different granular spatial level thereby facilitating a better understanding of how regional characteristics inform the spread of the disease (e.g. Fig. 2). Well-linked and granular data of this type can enable the construction of more accurate models of the pandemic by allowing reliable estimation of the required parameters for relatively small regions, avoiding the process of averaging them on a country level. They also increase our understanding of the efficacy of various interventions at the state and regional levels. Thus, we hope this resource in combination with mathematical modelling and machine learning for data analytics will enhance our understanding of the COVID-19 pandemic and facilitate the development of strategies to reduce the impact on society. Some of the key questions which the OxCOVID19 Database can help to answer include the assessment of the effectiveness of different types of non-pharmaceutical interventions such as government lockdowns, mobility restrictions, and social distancing in reducing the spread of infections 6 .

Methods
Data sources. All the data available in the OxCOVID19 Database are collected from publicly available sources including scientific reports 7 , government press releases, briefings, and similar. For the epidemiological data we relied mostly on official government sources including websites and repositories from Ministries of Health, regional Public Health Authorities, university research groups and official social media accounts.  13 . A full and updated list of data sources is maintained on https:// github. com/ covid 19db/ data. There have been numerous challenges in assembling our OxCOVID19 Database. Since this is a "live" database, we had to build a system architecture allowing for daily fetching and validation of the data (see Fig. 3). The datasets used in our work often reported at different levels of geographical division. For example, in the UK, epidemiology was reported to Level 3 while mobility was reported to Level 1 or 2 depending on the source. This presents a substantial challenge when joining the tables. To overcome this problem we introduced a common key, the GID, described in detail in the next section. The sources often change the format, move their location or stop reporting, which present another challenge. To address this, we implemented an automated validation system to detect such issues and alert us when changes need to be made to the fetchers. Different sources report data at different spatial and temporal resolutions.. We provide information about each source used in our database both on the official project's GitHub (https:// github. com/ covid 19db/ data). At the time of publication we used 53 sources for the EPIDEMIOLOGY Unifying the data across geographical regions. The OxCOVID19 Database links multimodal data for different levels of administrative division. The largest administrative subdivision of a country will be called the  www.nature.com/scientificreports/  www.nature.com/scientificreports/ "first-level administrative division", "first administrative level", or "Level-1" (e.g., "states" in the USA or "voivodeships" in Poland). The next smaller regions will be described as the "second-level administrative division", "second administrative level" or "Level-2" (e.g., "counties" in the USA, "powiaty" in Poland); similarly "third-level administrative division", "third administrative level" or "Level-3" (e.g., "gminy" in Poland). Not every country has a third level (e.g., the USA), some countries do not even have a second level, but we include these where available.
We link data from multiple sources and various levels of administrative subdivision into one relational database using the GID from the Global Administrative Areas (GADM) database as a geographical identifier 5 . The goal of the GADM database is to "map the administrative areas of all countries, at all levels of sub-division". It is freely available for non-commercial use as is the case here. The GID identifies a geographic area with an alphanumeric string. For example, the string 'CHN. 16.4_1' can be decoded as follows: the first three letters are the ISO 3166-1 alpha-3 country code for Mainland China; the '16' indicates the Level-1 subdivision, here the province of Jiangxi; the '4' represents the Level-2 division, here Jingdezhen City; finally, the '_1' indicates a version number for the GID, which only changes in the event of major internal reorganisation and allows for backwards compatibility. To each GID a polygon is associated giving the boundaries of the geographical area to an extremely fine resolution. A dedicated table named ADMINISTRATIVE_DIVISIONS contains the GIDs together with their place names and locations, expressed both as a single point and as a polygon. The resolution of the polygon is reduced from that given in GADM in order to conserve space. Every record, insofar as possible, is matched to a GID or a list of GIDs. This allows the user to match different modalities of data together using GIDs, including across hierarchies. While GID could potentially be used as the main geographical key in our database, we have chosen to introduce some redundancy and use names (as standardised in GADM 5 ) of the regions along with their GIDs for ease of use as well as to permit the exceptional absence of GIDs.
Typically, assigning a GID to the region referred to in a record is a straightforward matter. Slight inconsistencies with spelling variations, prefixes, and suffixes can sometimes be an obstacle to carrying out a direct text match, but this requires only limited and obvious manual adjustment. Some records lack the geographical specificity needed to assign a GID, such as where the administrative subdivision is listed as "Unknown". There are, however, more challenging situations, often relating to administrative reorganisations which have taken place since the release of GADM 3.6. These are best demonstrated by means of example.

Handling of boundary changes: some examples.
(i) In UK-England, some local authorities (Level-2 units) have undergone boundary changes, with Level-3 units being moved between or promoted to Level-2 units. Each local authority is therefore easily expressed as a list of Level-2 or Level-3 units. Epidemiology. While our goal is to collect epidemiological data on the regional level for as many countries as possible, we initially sought to prioritise countries to include. To determine priority levels we incorporated three criteria: total population, air traffic volume, and number of COVID-19 related deaths. All countries were ordered according to each of the three criteria on 5 May 2020 and the ranks of countries with respect to each criterion are added to give the priority score (i.e., we used a Borda count). The top 20 countries according to this rank at the time of prioritisation were: United States, China, India, Brazil, United Kingdom, Indonesia, Germany, Turkey, Japan, Spain, Ireland, Russian Federation, France, Italy, Mexico, Pakistan, Belgium, Canada, Iran, Nigeria. We have successfully included regional data for all but Turkey and Iran in the database. At the time of writing, 41 countries have been included at level-1, of which 6 countries are present at level-2, with the United Kingdom at level-3.  10 . The variables provide information about temperature, sunshine, humidity and precipitation. This information is sampled daily and reported on a 12 km × 12 km uniform latitudelongitude grid.
To provide these data in a manner which permits linking with the other tables, we report the mean value for each variable across all grid points contained in Level-1 and Level-2 GADM subdivisions, along with the standard deviation and the number of grid points in the region. We report on a daily basis starting from 1 January 2020.
This level of subdivision was chosen on the basis that almost all Level-2 regions contain a grid point. Where the region contains no grid point, no record is created. This, however, happens in fewer than 0.5% of the cases. Using Level-3 instead of Level-2 would result in a large number of missing records, while using Level-1 would be overly coarse leading to high standard deviations and reduced explanatory power. Further, values for larger geographical units can be obtained by the user by averaging over the smaller subdivisions taking into account the number of points in each region.
World Bank data. The World Bank Development Indicators dataset 13  The statistics are aggregated and equipped with the appropriate GID both at the country level and at a regional level where possible. These regions are generally larger than GADM Level-1 and included only for the same 20 countries which were prioritised for epidemiological data.
Our Integrated Values Surveys dataset is obtained by merging together all fully released waves of the World Values Survey and the European Values Study. There is no official release of this integrated dataset-we merged it following the official guidelines 14 making appropriate adjustments where the guidance has not provided the correct matching.
For each survey question, we report the frequency of each answer. Because each possible answer generates a column, the resulting table has more than 15,000 columns. To reduce the size of the table we instead stored all the statistics for each country/region in a nested dictionary, placed in the column "properties" in the SURVEYS table.

Data records
The database is available to download at https:// covid 19. eng. ox. ac. uk/. The data are stored in a PostgreSQL database. CSV extracts from this database are available to access at https:// github. com/ covid 19db/ data. Administrative divisions. The ADMINISTRATIVE_DIVISIONS table (see Table 1) contains the geographic features and information associated with each GID, extracted from GADM 5 . It includes six linking columns (K3)-(K8), followed by countrycode_alpha2, the ISO 3166-1 alpha-2 country code, adm_level, specifying which level of division it is, adm_area_1_code, adm_area_2_code and adm_area_3_code, providing the GID for each higher level administrative division, properties, which includes alternative names and identification codes and three geometric features: latitude and longitude, specifying the centroid of the region, and geometry, specifying the simplified boundaries of the region (shapefiles) for mapping purposes.  Table 2).  Table 3), prepared and curated by researchers from the Blavatnik School of Government, University of Oxford 4 . These indicators are grouped into the following categories: containment and closure, economic response and health systems, and miscellaneous policy announcements that do not fit anywhere else.
Mobility data. The MOBILITY table (see Table 4) includes all eight linking columns, (K1)-(K8), followed by a number of indicators of human mobility as reported by Google 8 . These data are derived from aggregated movements of Android phone users and are stratified by the location of the user: place of work, outdoor parks, recreation areas, grocery markets etc. This table also contains the change in traffic volume reported by Apple 9 from aggregated tracking of iPhone users of people walking, driving or taking public transit in their communities. Google measures mobility on any day relative to the median value for each of the five days falling on the same day of the week in the period January 3-February 6, 2020, while Apple measures all data relative to January 13, 2020. The data only describe mobility within particular locations for particular activities. They do not indicate the amount of travel between regions nor do they contain individual-level data.
Weather data. The WEATHER table (see Table 5) includes all eight linking columns (K1)-(K8) followed by 47 variables including temperature, sunshine, precipitation, air temperature, wind speed etc.     Table 7 Surveys. The SURVEYS table (see Table 6) includes seven of the eight linking columns (K1), (K3)-(K8) followed by samplesize indicating the number of people taking part in the survey for the region under consideration, properties, which is a dictionary containing the region/country statistics and wave, specifying the particular survey being reported.

Technical validation
The code used to build the OxCOVID19 Database was developed collaboratively. Working across several GitHub repositories (https:// github. com/ covid 19db) allowed us to share documentation and keep code organised and up to date. We encourage the research community to report any issues they find. Figure 3 shows the system architecture that is being used to collect, unify, store and share the data. We operate more than 70 fetchers to periodically obtain raw data from our sources. This automated process ensures that we collect the most recent data and reduces potential error due to manual entry. The "Unification" step ensures that the names in different tables in the OxCOVID19 Database are consistent across geographical regions. In the "Validation" step a check for consistency is performed. During the storing step, the last timestamp in input data is compared with the current time and if the inserted data are older than 14 days relevant warnings are generated as that may indicate the change in format of the fetched data or some other problem that needs to be fixed. The fetching process is triggered twice a day at 02:00 and 14:00 BST. The sharing process, namely publishing existing data sources to CSV files hosted on GitHub, is triggered four times a day. Citation advice. The OxCOVID19 Database is the results of many hours of volunteer efforts and generous contributions from many organisations listed in the Methods section under "Data Sources". We encourage the users of OxCOVID19 Database to cite, along with this article, the underlying sources. www.nature.com/scientificreports/ www.nature.com/scientificreports/

Code availability
The code for data acquisition and cleaning used in the processing of assembling the OxCOVID19 Database is on the GitHub repository: https:// github. com/ covid 19db.