Background & Summary

The COVID-19 pandemic represents an unprecedented health crisis. The importance of timely, high-quality, publicly accessible COVID-19 related data has been essential in order to monitor this impact. National authorities started collecting data, often through their health surveillance systems or by establishing ad hoc data collection systems after the start of the epidemic. Statistics on the number of COVID-19 cases, hospitalizations, intensive care admissions, and death counts, among other figures, became quickly available, often broken down at different geographical levels and for specific population characteristics. These data have been extensively recirculated, collated, commented on, and compared; however, they are usually poorly sourced and documented. Although there are other international databases of deaths related to COVID-19, some of which cover many countries (e.g. those compiled by the World Health Organization1, Our World in Data2, and John Hopkins University3), they all have limitations, mostly related to the quality and degree of detail in the metadata and lack of user-friendly ways to download time-series data. Because data on deaths attributable to COVID-19 are mainly imperfect statistics, the lack of documentation limits accurate comparisons.

“The Demography of COVID-19 Deaths” database4 focuses on collecting death counts attributable to COVID-19, providing details on the exact definition of COVID-19 death used in each country and the processing of the data. The database specifically aims to provide tools for assessing data coverage and comparability, over time and across countries that can be used by the research community and others (policymakers, journalists, etc.) for accurate trend analysis. Related documentation is collected from official statistics bureaus and epidemiological surveillance agencies, and published in the database with the following information:

  1. 1.

    Identify the exact source of data and its main features, such as the type of data collection system (e.g. health surveillance system, vital statistics, hospital records, etc.), collection and publication patterns, and quality control protocols.

  2. 2.

    Describe the conditions for reporting deaths from COVID-19 and assess the degree of completeness of the statistical information, as well as the changes over time in each country (e.g. whether the reported deaths only include those occurring in hospital settings or also those occurring in nursing homes or private residences).

  3. 3.

    Illustrate possible sources of misinterpretation, by drawing attention to the reference date for which deaths are reported and their possible reporting delays. Depending on the country, deaths may be reported at the time of occurrence, at the time of registration, at the time the information is entered into the statistical system, or at the time when the figures are officially published.

  4. 4.

    Specify the criteria considered in the attribution of a death to COVID-19 and its confirmation mechanisms. The criteria may vary among countries or between one data source and another within the same country. Some data sources only include confirmed cases, e.g. on the basis of biological tests or clinical diagnosis, while others also include suspected cases based on symptoms or proximity to a known case.

The database provides cumulative death counts by sex and age. This demographic information is an important determinant of COVID-19: older people are more vulnerable to the infection and its most severe forms, and differences between men and women have also been well-documented5. As such, variations in population structure are expected to affect the number of deaths and must be taken into account in international comparisons of COVID-19 mortality. The database also provides the most up-to-date population counts in each country. Information on the place and date of occurrence of the deaths was added where available.

Data collection started at the beginning of the pandemic. The database was launched on 2 April 2020. As of mid-December 2021, it covers 21 countries: Austria, Belgium, Canada, England and Wales, Denmark, France, Germany, Italy, Japan, the Netherlands, Norway, Portugal, the Republic of Korea, the Republic of Moldova, Romania, Scotland, Spain, Sweden, Switzerland, Ukraine, and the United States of America. An online survey on the database use revealed visits from researchers, students and journalists, with more than 660,000 views up to mid-December 2021.

The database was created and is distributed by the French Institute for Demographic Studies (INED by its French acronym), and will be maintained online even beyond the end the pandemic. INED has extensive experience in maintaining demography databases, through its involvement in databases such as the Human Mortality Database (mortality.org).

Methods

The operational team of the “The Demography of COVID-19 Deaths” has selected the 21 countries in the database based on the availability of metadata and of periodic data publications. Each member of the database operational team collects death counts and the related documentation for up to two countries. These country specialists monitor data availability daily and collect all the information relating to COVID-19 mortality (statistical data and documentation) made available by national statistics offices and health institutions since the beginning of the pandemic. This implies that for each country, data are added retrospectively according to their date of reference (when possible). Country specialists prepare country data files in pre-established standard formats in Excel spreadsheets and update the country-specific documentation. For some countries data files can be formatted in statistical software because they are provided in .csv, .txt or .xlsx. However, for other countries this is not possible because the data comes in .pdf or in screen shots of online dashboard or images. For the latter countries, data must be copied “by hand” in the standard Excel spreadsheet.

Data for each country included in the database are accompanied by explanatory notes providing detailed explanations about data heterogeneity. These notes are structured in five sections and describe:

  1. 1.

    The data source(s), the specific institution and official website(s);

  2. 2.

    Data coverage (i.e. whether the data include only hospital deaths, all deaths, or deaths which occurred elsewhere, and, if available, a description of the cause-of-death certification process);

  3. 3.

    Data collection methods (i.e. information on the national protocols for reporting COVID-19 deaths);

  4. 4.

    The type of information originally available for each death (e.g. sex, age, date and place of occurrence, geography, comorbidity);

  5. 5.

    Publication frequency and data cut off time (i.e. the time at which the death count is stopped before publication by the organization in charge of disseminating the data);

  6. 6.

    A summary of any changes since the beginning of the pandemic in the criteria for the attribution of a given death to COVID-19 and in the data collection methods. The implementation dates of each of these changes are also indicated.

Until the beginning of July 2020, the data series were updated on a daily basis. Since then, the updates have been carried out on a weekly basis, as the pace of publication by the national statistics offices slowed down during the Summer of 2020.

The operational team completes the data update procedure by preparing and releasing the pooled data and metadata sets (i.e., files containing data for all countries combined). During this process the information collected for the countries is harmonized into categories that can be compared across data sources. Table 1 is an extract of the codebook of variables that describe the definition of a COVID-19 death for each data source in the database. The complete codebook can be downloaded at https://dc-covid.site.ined.fr/en/data/pooled-datafiles/ and in figshare6.

Table 1 Data definition variables.

Data Records

“The Demography of COVID-19 Deaths” is freely available online [https://dc-covid.site.ined.fr/en/] for scientific use (see “Terms of use” page on the website: https://dc-covid.site.ined.fr/en/contact-terms-use/). A snapshot of the dataset is also available at the figshare repository6. The database aims at addressing the needs and interests of a variety of database users, including researchers as well as non-specialists and the general public. The database consists of a collection of webpages: one for each country, one for the pooled datasets, and additional webpages including a discussion of key data issues.

In the country-specific webpages [https://dc-covid.site.ined.fr/en/data/], for each country, the following files are available for download: 1) a spreadsheet file with the actual data coming from one or (if applicable) several data sources, also containing a summary of the metadata on each data source, 2) a document containing country-specific explanatory notes, and 3) all the original documentation (methodological documents and official reports) from which the data and the metadata have been extracted.

The original documentation can be used to specify data characteristics, data heterogeneity, and possible biases across different sources, for rigorous international comparisons. Each country-specific webpage also displays some summary information about coverage and the national sources of information, as well as a link to the archive containing all prior versions of the data files (Fig. 1). The country-specific webpage is intended for non-specialist users. It provides data in a user-friendly format along with warnings about data idiosyncrasies that may hamper country comparison.

Fig. 1
figure 1

Screen shot of a typical country page.

On the Pooled sets webpage [https://dc-covid.site.ined.fr/en/data/pooled-datafiles/], pooled data sets are available for expert users in .csv format. These data sets include all the data sources except for those that have been discontinued. They can be downloaded in a zipped folder along with a file containing population estimates, the pooled metadata set, and a codebook of variables. Countries in the pooled data sets are referenced with the ISO 3166 numeric and alpha-3 codes which helps to geolocate and merge the data with indicators available in other international databases such as testing, short-term estimates of excess mortality, vaccination, and policy indicators.

The following zipped folders are available:

  1. 1)

    Cumulative deaths by sex and age https://www.ined.fr/fichier/rte/166/Page%20Data/Pooled%20Datasets/AgeSex.zip],

  2. 2)

    Cumulative deaths by places of death [https://www.ined.fr/fichier/rte/166/Page%20Data/Pooled%20Datasets/PlaceOfDeaths.zip],

  3. 3)

    Cumulative deaths by publication dates [https://www.ined.fr/fichier/rte/166/Page%20Data/Pooled%20Datasets/PublicationDate.zip],

  4. 4)

    Cumulative deaths by occurrence date of the deaths [https://www.ined.fr/fichier/rte/166/Page%20Data/Pooled%20Datasets/OccurenceDate2021.zip].

The pooled data sets can be downloaded from the website, including previous versions of the data files. They are also available in figshare6 updated as of mid-December 2021, along with an R code that can be used to retrieve the latest version of these datasets from the website.

The pooled metadata set summaries the main characteristics of the data for proper international comparisons, such as the definition of deaths attributable to COVID-19, confirmation criteria, data coverage, co-variables, and the collection and publication patterns of each data source (Tables 1 and 2). Metadata for the population estimates are also available. Through the R code available in figshare6, users can merge data and metadata pooled files to keep track simultaneously of changes in the data and in the definitions over time. Merged data and metadata pooled files can be used to select comparable subsets according to data characteristics.

Table 2 Description of variables contained in the pooled metadata set and in the online Data Availability Explorer about the death counts and data collection methods.

Users can explore the data availability in the database according to metadata characteristics through an online Data availability explorer [https://ineddemographiecovid19.shinyapps.io/DataViz/]. This tool shows the heterogeneity of the data across data sources and over time, as to e.g.: width of age groups, data collection systems, data coverage, identification of COVID-19 (suspected or confirmed, cause of death), reference date (e.g., registration, publication, or occurrence date). It can be used to identify data sources with similar characteristics, which can therefore be compared. Because some data sources have changed their characteristics over time, sound comparisons could be limited to specific periods. The explorer is updated monthly.

Technical Validation

Collected data undergo a number of validations before and after publication. With the aim of ensuring that both the data and the associated documentation are of the highest quality, before the data for a new country are published in the database, an in-country expert is contacted to contextualize and interpret the information. The national expert provides feedbacks about the most reliable data sources and checks the accuracy of the explanatory notes prepared by the database team. In addition, three team members oversee the work carried out by the country specialists before the publication of each country update. They check the accuracy of the daily data against the original data sources and monitor completeness of the documentation. This is especially important for those countries for which the data have to be copied manually because the data are only available in formats that are not reusable in statistical software (e.g. pdf. and online dashboard or images).

The database team also performs other types of validations through data analyses. The first one is the graphical representations of the weekly standardized death rates since the beginning of the pandemic in each country. These graphs are available online for illustrative purposes [https://www.ined.fr/fichier/rte/166/Page%20accueil/Taux-eng.jpg] and can be used to review the long-term trends of the pandemic. In addition, the team analyzed the COVID-19 data-related issues that may hinder international comparisons in a published scientific paper5. During this analysis, the operational team conducted comparisons with other datasets, such as the Human Mortality Database (HMD). Specifically, the age-and-sex proportional distributions of the cumulative number of COVID-19 deaths having occurred during the first wave (deaths occurred up to September 15, 2020) were assessed using the age-and-sex distributions of previous all-cause mortality reported by the HMD. Further analysis considered comparisons among countries based on classic demographic indicators such as sex ratios of age-standardized and age-mortality patterns. The database has also been used by international researchers and has already led to articles published in influential journals such as The Lancet7, Proceedings of the National Academy of Science8,9, and Demographic Research10.

The stringency of the data collection and metadata review processes has limited the number of countries available in the database. In some cases, there may be no country specialist and/or in-country expert in the database operational team to ensure that the data available are properly interpreted. In other cases, national statistics offices in charge of disseminating COVID-19 mortality data do not provide documentation that is detailed enough to assess the quality and comparability of the available statistical information. These are important criteria for a country to be included in the database.

Usage Notes

“The Demography of COVID-19 Deaths” database is an open access database [https://dc-covid.site.ined.fr/en/]. No registration is required. Anyone can browse the database content and download the data, which are available both as spreadsheets and .csv files. Specific citation guidelines are indicated in a “Terms of use” page on the website. Users are invited to contact the database team at eo-dc-covid@listes.ined.fr with their questions or suggestions, or to volunteer their help with accessing data from additional countries.

In addition to providing data to users, the “The Demography of COVID-19 Deaths” database highlights the potential uses that may be made of the data, while pointing out the specificities, differences and shortcomings of the data sets [https://dc-covid.site.ined.fr/en/presentation]. In this sense, a core feature of the database is to illustrate critical data issues to be considered when conducting comparisons over time and/or across countries. Simple cross-country examples are given to illustrate these issues and demonstrate the need to take them into consideration when analyzing pandemic mortality data.

Users are also invited to check the Data availability explorer before engaging in international comparison. The database team’s first analysis pointed out the need to take data heterogeneity into account when identifying the national and transnational characteristics of COVID-19 mortality rates and trends5. The availability and degree of detail of the COVID-19 death counts by age and sex varies between data sources. Variations range from differences in the open-ended age interval (the maximum age to which deaths are reported) and the age-group intervals (Fig. 2), to the diversity of the mechanisms implemented for confirming COVID-19 infections or attributing a death to COVID-19 (Fig. 3), all of which may bias comparison. Regarding the latter point, while most data sources rely on confirmation of COVID-19 through laboratory testing or clinical symptoms, some accept more loosely defined criteria and include suspected as well as confirmed COVID-19 deaths. Users can harmonize age-group intervals at their convenience; however, they cannot do anything about the other sources of heterogeneity in the data. To correctly interpret the pattern of trends and international differences in COVID-19 pandemics, therefore, comparisons should only be carried out between sources that use comparable COVID-19 attribution and confirmation methods5.

Fig. 2
figure 2

COVID-19 data availability by sex and age according to the open age interval and the width of the age-groups by data source. This figure was created through the Data Availability Explorer here https://ineddemographiecovid19.shinyapps.io/DataViz/. Only one date per month is shown, but the same figure for all the available dates can be visualized online.

Fig. 3
figure 3

COVID-19 data availability by sex and age according to the attribution criteria and confirmation mechanism in use by each source. This figure was created through the Data Availability Explorer here https://ineddemographiecovid19.shinyapps.io/DataViz/. Only one date per month is shown, but the same figure for all the available dates can be visualized online.

Similarly, data sources vary according to the type of collection system, though surveillance systems (i.e. systems put in place by governments in urgent situations) are the most common. The type of data collection system affects the process of data preparation and updating, as well as the frequency with which data are released. Surveillance systems and systems based on the transmission of information from local authorities or hospitals to the centralized body are often associated with the daily release of updated data. Systems based on the vital statistics system, on the other hand, tend to release data on a less frequent basis because the data undergo verification protocols and corrections. These differences in data collection system types and data release procedures may affect the degree of completeness of the data: the faster the data is released, the less accurate they are likely to be.

The “Demography of COVID-19 Deaths” database is the only database so far to collect, update, and systematize the metadata associated with the mortality attributed to COVID-19. Detailed metadata enable users to make informed decisions regarding the most suitable data for their comparative analyses and to understand the main limitations of the results. This is a major strength of the database because most reported COVID-19 statistics come from ad hoc systems. Its user-friendly structure helps users with varying levels of expertise to find the information collected at any time, as it provides continuous access to previously released files and official documentation for each country.