Introduction

The key parameter to understand and model the evolution of the COVID-19 pandemic is the number of daily infections. Despite its importance, however, it is difficult to estimate reliable data due to the bias in official figures1. For symptomatic patients, there are delays of 5.79 days from infection to symptom onset and 5.82 days from symptom onset to diagnosis2. In addition, the presence of asymptomatic people and those with mild symptoms hinder their detection. For example, 86% of infections were undetected in Wuhan, China, prior to 23 January 2020, when travel restrictions were imposed3. Additionally, not all patients with COVID-19 compatible symptoms are tested, especially at the beginning of the pandemic. So, it is widely assumed that the reported infections are just a fraction of the actual ones. In Spain, a National Seroprevalence Study based on ~ 55,000 random participants estimated that only 10.7% (CI 95% 10.1%-11.3%) of the actual infections had been detected during the first wave (before 22 June 2020)4,5.

Another problem is the inconsistency of daily infection time series and the variability of the case definitions. For example, on 13 February 2020, China changed the parameters to confirm new cases and ~ 15,000 new infections were counted in a single day, while daily infections during the previous week were less than 35006. In Spain, the method of counting official numbers of infections has been modified several times since the pandemic was decreed. On 22 April 2020, a total of 213,024 cases were reported from positive PCR and antibody tests7; however, total cases decreased to 202,990 the next day and thereafter, because only positive PCRs were included in the total cases8. Accumulated cases on a given day are obtained by adding the new cases that day to the accumulated cases of the previous day. That has not always been the method, however, because some Spanish regions reviewed and modified the cases in previous days9,10. This inconsistency of time series hampers any kind of accurate analysis. Additionally, data are continuously being updated. In Spain, official daily infection data available during the first wave reported 1832 new infected on 14 March 2020, but current official data (as downloaded in February 2021) reported 7478 new cases for the same date.

High fidelity time series of each parameter of an epidemic are crucial to run reliable epidemiological models. The number of new infections per unit time is one of the essential parameters. The official data do not reflect the actual day of infection nor the real number of infections. To overcome this limitation, we propose a retrospective methodology to infer daily infections from daily deaths, believing the number of deaths to be a more reliable parameter than the official number of daily infections1. These estimated infections are closer to the date of infection, avoiding delays from the incubation period and symptom onset to diagnosis. Then, the time series of daily infections would help to better understand the pandemic dynamics and to quantify the effectiveness of the different control strategies, avoiding the bias of the official data. In addition, those numbers can be used to feed epidemiological models with more realistic data. Finally, we will apply this methodology in Spain as a whole and each of its 19 administrative regions (17 autonomous communities and 2 autonomous cities).

Methods

REMEDID

We named this methodology REMEDID, REtrospective Methodology to Estimate Daily Infections from Deaths and it is described as follows. Date of infection (DI) of an individual may be estimated from the death date (DD) by subtracting the incubation period (IP) and illness onset to death (IOD) periods, hence

$$DI \, + \, IP \, + \, IOD \, = \, DD.$$

Therefore, DI can be estimated from DD as far as IP and DD are known; however, neither IP nor IOD are fixed values. On the contrary, they are random variables that can be approximated by probability distributions. From dozens of cases in Wuhan, Linton et al.11 approximated IP for COVID-19 with a lognormal distribution, XIP, with a mean of 5.6 days and median of 5 days, and IOD with a lognormal distribution, XIOD, with a mean of 14.5 days and median of 13.2 days. Therefore, the infection to death period is a random variable, XIP+IOD, that follows the distribution XIP + XIOD.

Although the addition of two lognormal distributions does not follow any commonly used probability distribution, its probability density function (PDF) can be estimated convolving the PDF of the two variables12. If g(t) and h(t) are the PDF of IP and IOD, respectively, then their convolution defines the PDF of XIP+IOD,

$$f\left({t}_{0}\right)=\underset{-\infty }{\overset{+\infty }{\int }}g\left({t}_{0}-t\right)\cdot h\left(t\right) dt,$$

where t0 is a positive real number representing days from infection. Then, f(t) is the PDF of the time from infection to death, with a mean of 20.1 days and a median of 18.8 days. Figure 1 shows g(t), h(t), and f(t), as well as a lognormal PDF with the same mean and median as f(t) for comparison. Note that the probability that XIP+IOD is less than or equal to 33 days is 0.95.

Figure 1
figure 1

Probability density functions (PDF) of incubation period (IP, blue line) and illness onset to death period (IOD, red line) from Linton et al.12. Green line, f(t), is the convolution of both and represents the PDF of infection to death. Black dotted line is a lognormal PDF with the same mean and median as f(t).

Given a time series of deaths produced by the illness, x(t), we estimate the infection time series that produced such deaths, y(t). If we assume a case fatality ratio (CFR) of 100%, y(t) would represent all daily infections. To calculate y(t), we use the following likelihood-based estimation procedure.

Because the relative likelihood that a given infection will produce a death t days later is f(t), the infections at a given time t0 can be estimated as

$$y\left({t}_{0}\right)=\underset{{t}_{0}}{\overset{+\infty }{\int }}x\left(t\right)\cdot f\left(t-{t}_{0}\right) dt.$$

In practice, x(t) is a discrete time series and could be written as x(n), where n is an integer representing entire days. Let F(n) be a discrete approximation to f(t) as follows:

$$F\left(n\right)=\underset{n}{\overset{n+1}{\int }}f\left(t\right) dt.$$

Then,

$$y\left({n}_{0}\right)=\sum_{n={n}_{0}}^{+\infty }x\left(n\right)\cdot F\left(n-{n}_{0}\right).$$
(1)

A similar analysis was used previously to estimate a time-variable effective reproduction number for the severe acute respiratory syndrome (SARS) in 200313.

For a different CFR, the associated/estimated infections are estimated from y(n) in Eq. (1) as

$$\text{Inferred infections}\left(n\right)=\frac{y\left(n\right)}{CFR(n)}\times 100.$$
(2)

Equations (1) and (2) constitute the kernel of the REMEDID algorithm. Validation of the capacity of REMEDID to reproduce the number of new daily infections is in Supplementary Material. Note that any delay in death reporting will produce an underestimate of the infections. Then, interpretation of the last days of the time series must be done carefully.

Data

Official numbers of COVID-19 infections and deaths

COVID-19 daily infections and deaths for Spain are reported by the Centro de Coordinación de Alertas y Emergencias Sanitarias (CCAES), which depends on the Ministry of Health, Social Services and Equality. Time series for the 19 regions of Spain can be downloaded from the national organization that compiles and publishes the data from all regions, the Instituto de Salud Carlos III (ISCIII; https://covid19.isciii.es/). Time series for Spain were estimated by adding the time series of the 19 regions. The final access to infections from official data was on 16 February 2021 (hereafter referred as IO21). Those data have been greatly improved with respect to the daily situation reports published by CCAES during the first wave in the first semester of 2020 (hereafter referred as IO20; https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/situacionActual.htm). For example, IO20 cases were given the date of diagnosis, whereas IO21 cases were given the date at symptoms onset. When the date of symptoms onset was not available, it was estimated as follows: infections prior to 10 May 2020 were given the diagnosis date minus 6 days, whereas only 3 days were subtracted from the date of diagnosis of infections after 11 May 2020. Asymptomatic cases were given the diagnosis date. These corrections brought reported cases closer to the real infection dates, although they were still delayed by the length of the incubation period. In addition, re-evaluation of the suspicious cases brought out new infections. As a consequence, the first case in IO21 is set now on 1 January 2020, while it was on 20 February 2020 in IO20. The first official death was reported on 13 February 2020, although it was an isolated case. Deaths reported in consecutive days started on 3 March 2020. Then, time spans we considered for IO21 is from 1 January 2020 to 29 November 2020 and for official daily deaths from 3 March 2020 to 1 January 2021, specifically with 33 extra days as required to apply the REMEDID methodology.

When applying REMEDID the use of a reliable CFR is essential. Thus, to determine a realistic CFR in Spain, we used the total infections data from the longitudinal seroprevalence study conducted by ISCIII in 4 phases. The first three phases were completed from 27 April to 22 June and the results were published on 6 July 20204,5. In this longitudinal cohort study, ~ 68,000 people were randomly selected and ~ 55,000 of them agreed to be interviewed and tested in the three phases. The study found that 5.2% (CI 95% 4.9–5.5%) of the participants presented SARS-Cov2 IgG antibodies. Extrapolation of this result to the ~ 47 million of the population in Spain yielded 2.45 million infections until 22 June, which, with 29,674 accumulated COVID-19 deaths on 22 June, would give a CFR of 1.21% (CI 95% 1.15–1.29%). We assumed this value in our analysis for Spain until 22 June. The fourth phase of the study was completed from 16 to 29 November 2020 and the results published on 15 December 202014. In that study, ~ 51,000 people agreed to participate. The study found that 9.9% (CI 95% 9.4–10.4%) of the participants presented SARS-Cov2 IgG antibodies at any of the four phases. Then, there were 4.75 millions of accumulated infections in Spain, meaning 2.30 million new infections from 23 June to 29 November. Because there were 47,113 accumulated COVID-19 deaths on 29 November (17,439 new deaths from 23 June), it would give a CFR of 0.76% (CI 95% 0.71–0.81%). Thus, we assumed different CFRs for the different periods, in order to smooth the discontinuity effects, we propose a CFR time series segmentation as follows: (i) a constant value of 1.21% from 1 January to 22 June 2020, (ii) a constant value of 0.76% from 29 June to 29 November 2020, and (iii) a linear function from 23 to 28 June that would correspond to the interpolation of the CFR values for 22 June and 29 June. The CFRs of each region were estimated analogously (summarized in Tables 1, 2). Prior to 22 June, Asturias, Castilla-La Mancha, Castilla y León, Extremadura, País Vasco, and La Rioja had CFRs greater than 1.5%. In contrast, Andalucía, Canarias, Melilla, and Murcia had CFRs less than 0.7%. After 22 June, the CFR for Spain diminished in 0.45%, probably due to improved medical treatments and experience gained during the first wave. In this period, the number of regions with a CFR below 0.7% increased to 10 and those with a CFR greater than 1.5% decreased to 3. Most of the regions revealed certain improvement, particularly remarkable was La Rioja, with a CFR decrease from 3.13 to 1.12%.

Table 1 Accumulated COVID-19 deaths and MoMo excess deaths (ED) for Spain and its 19 regions.
Table 2 Accumulated COVID-19 deaths and MoMo excess deaths (ED) for Spain and its 19 regions.

Excess of expected deaths from MoMo

Expected deaths from any cause, with an error band representing a 99% confidence interval, are modelled by the Mortality Monitoring (MoMo) surveillance system for 22 countries in Europe (www.euromomo.eu). In Spain, MoMo depends on the ISCIII15 and data are available at https://momo.isciii.es/public/momo/dashboard/momo_dashboard.html#nacional. MoMo also collects daily deaths from any cause from the Spanish Instituto Nacional de Estadística (INE) and the notaries and civil registries. Daily deaths within the error estimates of the expected deaths are considered usual. Any deviation allows identification of unusual mortality patterns, as for the COVID-19 pandemic. The useful variable here is the excess deaths (ED), estimated as the difference between actual and expected deaths. The error band for ED can be estimated from the error band of the expected deaths by subtracting the latter from the actual deaths. Note that ED can show negative values (Fig. 2). By cautiously assuming that all significant ED is caused by COVID-19, ED can be used to estimate non-reported COVID-19 deaths. In this case, negative values for ED make no sense and are set to zero. The MoMo time series was downloaded on 16 February 2021 and we considered the period from 3 March 2020 (first non-isolated official COVID-19 death) to 1 January 2021.

Figure 2
figure 2

COVID-19 deaths and MoMo Excess Deaths (ED). (a) Spain; (bt) Spanish regions. Red lines are official COVID-19 deaths, solid black lines are MoMo deaths, and thin black lines delimit MoMo death error bands (99% confidence interval).

Given that MoMo ED in Spain was 45,119 (CI 99% 37,593–55,098) deaths from 3 March to 22 June, and 23,612 (CI 99% 11,356–39,690) from 23 June to 29 November, the CFR for the country estimated similarly as in “Official numbers of COVID-19 infections and deaths” section was 1.85% (CI 95% 1.74–1.96%) prior to 22 June, and 1.02% (CI 95% 0.97–1.09%) after 23 June. For each region, the CFRs were estimated similarly and are provided in Tables 1 and 2. In all cases, we used a piece wise CFR time series segmented as described in “Official numbers of COVID-19 infections and deaths” section.

Results

COVID-19 vs. MoMo

The COVID-19 official deaths and MoMo ED time series overlaped for the period from 3 March 2020 to 1 January 2021 for Spain and its 19 regions (Fig. 2). In general, there was good agreement between both datasets, meaning that most of MoMo ED were related to COVID-19 deaths. During the first wave, the most important differences were observed in Spain, Madrid, Cataluña, Castilla-La Mancha, and Castilla y León. Before 22 June in Spain, MoMo ED showed 15,445 accumulated deaths more than the official COVID-19 deaths, which is beyond the error band. That difference comes basically from the four regions with the largest numbers of deaths (Madrid, Cataluña, Castilla-La Mancha, and Castilla y León). Table 1 shows the accumulated values before 22 June, which were used to estimate the CFR for Spain and its 19 regions according to the third phase of the National Seroprevalence Study4,5. For all regions, the CFR estimated from MoMo ED was larger than the CFR estimated from COVID-19 deaths. In particular, Asturias, Canarias, and Murcia were twice as large. Ceuta and Melilla dramatically increased their CFR from MoMo ED, although that may be biased due to their small populations and numbers of deaths.

Similarly, the same variables for the period from 23 June to 29 November 2020 are reported in Table 2. In Spain, MoMo ED showed 6173 accumulated deaths more than the official COVID-19 deaths. This difference is a third of the difference observed prior to 22 June; because this is within the error band, there was a significant improvement in the detection of COVID-19 deaths in this period. Figure 2 also shows a general agreement between MoMo ED and official COVID-19 deaths time series after the first wave, with the exception of late July and early August. These differences were due to two heat waves that were responsible for at least 25% of the MoMo ED16.

Infections estimated from COVID-19 deaths

To illustrate the delay between official daily infections data and REMEDID estimated daily infections, we applied REMEDID from COVID-19 deaths assuming CFR = 100%. Figure 3 shows the current IO21 and the infections associated with COVID-19 deaths for the first wave. The latter in Spain reached a maximum on 13 March 2020 (Table 3), the day before the national government decreed a state of emergency and national lockdown. Thus, the adopted measures had an immediate effect, which was observed in the official data IO21 7 days later (20 March). This delay is similar to the incubation period (mean 5.78 days2), which could be explained because official infections were reported when symptoms appeared. This delay reached 16 days when we compared with earlier version IO20 (not shown), which highlights the usefulness of the methodology to reinterpret official data from very early stages of the pandemic. On the other hand, the maximum number of deaths was reached on 1 April, which was 19 days after the inferred infection maximum, bringing this delay close to the 20 days expected between infection and death (Figs. 1, 3).

Figure 3
figure 3

Official COVID-19 infections and deaths, and estimated infections with case fatality ratio (CFR) of 100% in Spain during the first wave. Left y-axis: COVID-19 daily infections IO21 (blue curve). Right y-axis: COVID-19 deaths (orange curve) and its REMEDID-estimated infections with CFR = 100% (red curve). All curves are for Spain. Thin blue and orange curves are daily data, and thick curves are smoothed by 14-day running mean. Arrows show delays between the maximum of inferred infections and maxima from COVID-19 deaths (orange arrows) and COVID-19 infections (blue arrows). Solid arrows are expected delays, dotted while arrows are observed delays.

Table 3 Date of first infection for REMEDID estimated daily infections from COVID-19 deaths (IRO) and from MoMo Excess Deaths (ED) (IRM), and for official COVID-19 daily infections released on June 2020 (IO20) and on February 2021 (IO21).

We applied REMEDID to the official COVID-19 deaths with the corresponding estimated CFRs (see “Data” section) to obtain the time series of estimated daily infections, hereafter referred to as IRO. Figure 4 shows IRO and the accumulated infections for Spain and its 19 regions. Note that in Spain, IRO are amplified versions of inferred infections in Fig. 3. In Spain, the first infection, according to IRO,, is on 8 January 2020 (Table 3), 43 days before the first infection was officially reported on 20 February 2020 according to IO20. By contrast, IO21 places the first infection on 1 January 2020. Spain reached the maximum number of IRO on 13 March, a day before the state of emergency and lockdown were enforced (Table 4). On 14 March, IO20 = 1832, and IO21 = 7478; however, IRO = 63,727 (CI 95% 60,050–67,403), 35 and 9 times IO20 and IO21, respectively (Table 5). This implies that on that day, IO20 and IO21 only reported 2.9% (CI 95% 2.7–3.1%) and 11.7% (CI 95% 11.1–12.5%) of new infections, respectively. Although detection of infections clearly improved from IO20 to IO21, almost 90% of the infections are still not documented in the peak of the first wave. The situation is similar for the accumulated infections before 22 June 2020, as reported by the National Seroprevalence Study4,5.

Figure 4
figure 4

Daily and accumulated infections for official COVID-19 daily infections (IO21), and daily infections estimated from COVID-19 deaths (IRO). Lines are daily infections and refer to the y-axis on the right; bars are accumulated infections and refer to the y-axis on the left. Red lines and cyan bars are official COVID-19 data; orange lines and blue bars are inferred infections with case fatality ratio (CFR) in Table 1. Thin orange lines correspond to the CFR confidence interval.

Table 4 Date of the most prominent relative maxima, for Spain and the 19 regions, of the REMEDID estimated daily infections from COVID-19 deaths (IRO) and from MoMo excess deaths (ED) (IRM), and official COVID-19 daily infections (IO21).
Table 5 REMEDID estimated daily infections from COVID-19 deaths (IRO) and from MoMo excess deaths (ED) (IRM) on 14 March, and for official COVID-19 daily infections released on June 2020 (IO20) and released on February 2021 (IO21).

In almost all regions, IO20 showed a delay of 1 month or more between the first infection and IRO (Table 3). No delay in IO21 occurred in Islas Baleares, Castilla-León, and Galicia, while in three regions (Cataluña, Madrid, and La Rioja), the first case occurred earlier than the first case of IRO. However, 6 regions had delays of 15 days and other 6 regions had delays of 1 month. According to IRO, all regions except Ceuta and Melilla had some infections in January, but in IO21 only 6 regions had infections in that month. In all scenarios, the first infections were in Madrid and Cataluña.

During the first wave, according to IRO most of the regions had maximum daily infections around 14 March. In Madrid, the maximum was reached on 11 March, coinciding with the educational centres closing and an official warning by the regional government (Table 4). Asturias was the last region to reach peak infections (25–26 March). The maximum percentage of documented cases (12.6%, CI 95% 9.2–18.4%) occurred in Asturias on 14 March, but in the other regions, only between 1.2 and 8% of the infections were documented (Table 5).

Figure 4 shows how the IO21 and IRO curves of Spain and the 19 different regions fluctuated following the same pattern until the middle of June 2020, but thereafter, they showed different patterns. This reflects the fact that the Spanish government had decreed the control measures for the whole nation until June, but thereafter, each regional government implemented its own control measures. For example, some regions (e.g., Aragón, Islas Baleares, Cantabria, Comunidad Valenciana, Extremadura, Galicia, Murcia, País Vasco, and La Rioja) had two peaks, but others had only one. An apparent maximum on 22 June in Islas Baleares is an artifact produced by the interpolation for transition from the two CFRs. Although beyond the scope of this work, it would be very interesting to investigate the effects of the different control measures implemented on the corresponding IRO for the 19 regions.

The Spanish COVID-19s wave reached a maximum of daily infections on 22 October from IRO and on 26 October from IO21. The delay of 4 days is similar to the mean incubation period (5.78 days2). The estimated number of new infections is still larger than the documented cases, but the shapes of the two curves are more similar in the second wave than in the first wave (Fig. S1). The same is true for the 19 regions, most of which had the largest peak around 22–26 October, with the exceptions of Canarias and Madrid, which reached maxima in late August and early September, respectively.

Infections from MoMo excess deaths

Assuming that MoMo ED accounts for both recorded and non-recorded COVID-19 deaths, negative deaths are meaningless, and they were set to zero. Then, the associated daily infections can be estimated, as in “Infections estimated from COVID-19 deaths” section, with a CFR of 100% from MoMo ED for Spain (Fig. 5). Note two main differences between this time series and that estimated from official COVID-19 deaths: (1) MoMo data present an error band that was inherited by the estimated infections; (2) MoMo ED estimated infections reached a maximum of 1443 (CI 99% 1329–1547), doubling the 776 inferred daily infections from official COVID-19 deaths in Fig. 3. This is because maximum MoMo ED was 1,584 (CI 99% 1468–1686) and maximum COVID-19 official deaths was 828, both estimated from the 14-day running mean time series. The maximum of inferred infections was reached on 13 March, just one day prior to the state of emergency and lockdown. The expected and observed delays with respect to official infections and MoMo ED were similar to those observed for estimated infections from official COVID-19 deaths. Error bounds of the estimated infections in Fig. 5 were computed from the MoMo ED error bounds. However, it should be highlighted that the combination of the error bounds from MoMo ED and the estimated CFRs might lead to unrealistic error estimates. To avoid this, the error estimates in Fig. 6 were estimated from the MoMo ED time series (no error bounds) and the error bounds of the estimated CFRs.

Figure 5
figure 5

Official COVID-19 infections, MoMo Excess Deaths (ED), and estimated infections with case fatality ratio (CFR) of 100% in Spain during the first wave. Left y-axis: COVID-19 daily infections IO21 (blue curve). Right y-axis: MoMo ED (orange curve) and its REMEDID-estimated infections with CFR = 100% (red curve). All curves are for Spain. Thin blue and orange curves are daily data, and thick curves are smoothed by 14-days running mean. Dashed curves represent the error estimate of MoMo ED (orange) and inferred infections (red). Arrows show delays between the maximum of inferred infections and maxima from MoMo ED (orange arrows) and COVID-19 infections (blue arrows). Solid arrows are expected delays, dotted while arrows are observed delays.

Figure 6
figure 6

Daily and accumulated infections for official COVID-19 daily infections (IO21), and daily infections estimated from MoMo Excess Deaths (ED) (IRM). Lines are daily infections and refer to the y-axis on the right; and bars are accumulated infections and refer to the y-axis on the left. Red lines and cyan bars are for official COVID-19 data; and orange lines and blue bars are for inferred infections with case fatality ratio (CFR) in Table 1. Thin orange lines represent the error estimate of inferred infections.

The REMEDID was applied to the MoMo ED with the corresponding CFRs (see “Data” section) to obtain the estimated daily infections, which will be referred hereafter as IRM. The IRM were calculated for Spain and its 19 regions and are depicted in Fig. 6, as well as the accumulated IRM. In Spain, the first infection shown by IRM happened on 9 January, with an error estimate from 9 to 10 January, 41 to 42 days before the first documented infection of IO20 on 20 February 2020 (Table 3). The maximum IRM was 77,855 (CI 95% 73,364–82,347) reached on 13 March. On 14 March, IRM showed 14,128 infections more than IRO (Table 5). Notice that the CFR used with MoMo ED data was larger than the one used with official COVID-19 deaths data, which makes this difference even more remarkable, because the larger the CFR the lower the estimated infections. Therefore, if the true CFRs, which are unknown, were used in both cases, IRM would double IRO on 14 March, as happened when a CFR of 100% was used (Figs. 3, 5). Notice that with the CFRs used, the IRM and IRO resulted in the same accumulated infections on 22 June and 29 November, matching the results of the seroprevalence study. Nevertheless, IRM showed 42 times more cases than IO20 and 10 times more than IO21 on 14 March, detection of official cases of only 2.4% (2.2–2.5%) and 9.6% (9.1–10.2%), respectively.

Table 3 shows the estimated date of first infection for Spain and by region. Note that the first cases of IRM in Spain were on 9 January and in Aragón, Canarias, and Navarra on 8 January, which is possible because significant excess deaths in a region may not become significant for the whole country. In general, the maxima of daily infections were closer to those on 14 March in IRM than in IRO. During the first wave, all regions showed a single maximum, except for Ceuta, Melilla, and Murcia, which showed two maxima (Fig. 6). In general, the IRM time series in all regions were similar during that period. The official data clearly under-detected infections during the first wave. On 14 March, IRM were comparable to IRO, overlapping CI in all regions, but not in Spain as a whole (Table 5). During the second wave, there was improved detection of cases with differences among regions.

Discussion

Infection dynamics is key to understand and model the COVID-19 pandemic. Nevertheless, documented infections in Spain (and nearly worldwide) are of limited quality both qualitatively and quantitatively. Specifically, reported infections were underestimated and delayed 16 days compared to the estimated date of infection during the first months of the pandemic (according to IO20) and 6 days in the current version of the data (IO21). According to the National Seroprevalence Study, only 10.7% of infections were documented in Spain prior to 22 June 20204,5 and 64.3% from 23 June to 29 November 202014. The huge underreporting during the first wave is mainly due to: (1) the lack of testing for asymptomatic and mildly symptomatic people3, which could have been detected with a “test, track and trace” strategy, as done in South Korea, China, and Singapore17; (2) deaths outside of the hospitals, with many cases from nursing homes for the elderly, where 6664 deaths with COVID-19 and 7082 with similar symptoms, but not officially diagnosed with COVID-19, occurred from January to May 202018. This poor quality of data hinders any options to study the infection dynamics based on official data.

To overcome these difficulties, we propose the REMEDID algorithm, which allows calculation of time series for daily infections from daily death time series, a more reliable source, which can be applied in early stages of any COVID-19 outbreak with only 33 days delay. The REMEDID algorithm needs only three inputs:

  1. (1)

    PDF of the period from date of infection to date of death. In this study, that was calculated by combining the PDF of the Incubation Period and the Illness Onset to Death period, as estimated by Linton et al.11 for cases in Wuhan, China. The incubation period can be assumed to be similar in China and elsewhere, because it is a characteristic of the virus, but the illness onset to death period could be affected by the type of health care and, thus, by each national health system2 or even ethnicity19. No such information is available for Spain to date, but it would be desirable to use a local distribution of Illness Onset to Death when available for Spain. The calculations could be redone easily by applying the MATLAB code available in GitHub (https://github.com/isavig/REMEDID).

  2. (2)

    Time series of daily deaths. We used two sources, the official COVID-19 deaths and the MoMo ED, because official COVID-19 deaths have been underestimated, especially during the first wave in the regions of Castilla-La Mancha, Castilla y León, Cataluña, and Madrid. Such underestimation was due to the fact that non-hospital deaths and cases during the early pandemic were somewhat undertested (Table 1 , Fig. 2 ). We expect that undocumented COVID-19 deaths appeared in the excess of expected deaths from any cause calculated from MoMo15. Interpretation of those results must be made carefully because ED is a statistical variable calculated from what happened in previous years and some expected deaths, such as those produced by traffic and work accidents, may have been reduced due to the national lockdown that began on 14 March 2020 and gradually has been eased. Nonetheless, expected fatalities were very low (traffic and work accidents less than 5 deaths daily 20,21) compared to the daily death toll fromCOVID-19 (300 deaths per day average from March to May). Some non-COVID-19 deaths may have been influenced by other pandemic-related factors22, such as breakdown in medical follow-up (especially chronic conditions and among older patients), delays in attending to the healthcare appointments and/or being attended due to the collapse of the Health System, transplant delay and reduction in donors’ numbers from 7.2 to 1.2 per day23, and the decrease of the interventional cardiology activity between 40 and 81%24. Thus, MoMo ED do not exactly correspond to COVID-19 deaths and must be regarded with caution. A detailed study of the changes in other sources of mortality is recommended. Nevertheless, MoMo ED is a valuable source of data that must be explored considering the enormous bias in official data of daily infections.

  3. (3)

    CFR estimate. CFR was defined as the ratio of deaths among total infections. Thus, the lower the documented infections, the larger the CFR and under-detection of infections leads to overestimation of the CFR. For example, in Spain on 22 June 2020, official data showed 261,111 accumulated infections and 29,676 deaths, a CFR of 11.37%, which was clearly overestimated because the number of infections was underestimated. The National Seroprevalence Study in Spain with random testing showed that 5.2% of the population (~ 47 million) had already been infected4,5. Because CFR also depends on death records, we calculated a CFR of 1.21% (CI 95% 1.15–1.29%) from official COVID-19 deaths and a CFR of 1.85% (CI 95% 1.74–1.96%) from MoMo ED. However, the CFR is not constant in time nor in space. Different regions showed different CFRs and the fourth phase of the seroprevalence study showed a general decrease in the CFRs for the period 23 June–29 November. In Spain, the CFR decreased from 1.21% to 0.76% (CI 95% 0.71–0.81%) from official COVID-19 deaths and from 1.85% to 1.02% (CI 95% 0.97–1.09%) from MoMo ED. This decline is probably due to the improvement of COVID-19 treatments and experience gained since the beginning of the pandemic25. Similar seroprevalence studies are desirable in other countries, because CFRs from official data are so variable. For example, on 22 June 2020, CFRs of 0.6, 5.2, and 14.5% were reported for Iceland, the world, and Italy, respectively26. In addition, we assumed a constant CFR prior to 22 June and from 23 June to 29 November 2020, although introducing time dependence on the CFR would enhance the infected estimates. A constant CFR does not consider who is affected by saturation of the health systems due to limited human and material resources. However, to estimate a time-variable CFR requires accurate infection records among other variables, which are not currently available. Moreover, CFRs among severe cases (mostly detected) are much greater than those among asymptomatic and mild cases (mostly undetected). Consequently, the CFR calculation has a high degree of uncertainty. Nevertheless, because information on the CFRs is expected to improve, calculations in this study can be updated easily using the MATLAB code for REMEDID algorithm available in GitHub (https://github.com/isavig/REMEDID). Supplementary Material

We ran the REMEDID algorithm to provide the estimated time series of infections for Spain and for its 19 regions for the period from 8 January to 29 November 2020. These time series provided valuable information to understand the time evolution of the pandemic. Our main findings for Spain are:

  1. (1)

    On 14 March, when the national government decreed the state of emergency and national lockdown, estimated infections were between 35 and 42 times larger than officially documented ones at that time, and 9–10 times more than current official data.

  2. (2)

    The national lockdown had a strong effect on the transmission of the virus, as shown by a rapid slope decrease around day 13 March.

  3. (3)

    The first infection was better determined from inferred infections than from official records during the first months of the pandemic.

The REMEDID algorithm has strengths and limitations. First, it uses the number of deaths, which are more accurately recorded than infections. Second, it allows elucidation of the date of the infections estimated during the seroprevalence studies, which only determines the total number of infections. Thus, the REMEDID algorithm complements the seroprevalence studies. Third, the estimated daily infections place the probable date of infection more accurately than the official numbers. Note that official infections are delayed by the incubation period (from infection to illness onset). Therefore, although the maxima of infections theoretically should have coincided with the state of emergency and the national lockdown on 14 March, official infections in Spain were maximum 6 days later. Infections estimated from death data do not show such delay, however. Determination of the actual day of infection is very important for evaluation of the immediate effect of the implemented countermeasures. In Madrid the estimated infections reached their maximum on 11 March when regional government warned the population to stay at home and schools and universities were closed, forcing 1.2 million students to stay at home. Moreover, overall recommendations on disease control and social distancing were given by the Ministry of Health to the general public since the beginning of March (see https://www.mscbs.gob.es/en/profesionales/saludPublica/ccayes/alertasActual/nCov-China/ciudadania.htm).

Four, the REMEDID algorithm can be applied to any region, regardless of population size, but being cautious with small populations. For example, in this study we applied REMEDID methodology to a small city (Ceuta with 84,777 inhabitants), a medium size region (Madrid with more than 6.6 million inhabitants), and a country (Spain with 47 million inhabitants). This versatility allows study of different epidemic dynamics as reflected in the variety of shapes and slopes in the curves depicting death and infection records, that show the heterogeneity of the epidemic at each region with different population size and density, social behaviour and transmission pattern, especially during the second wave, showing unique dynamics that must be addressed individually. Knowledge of the real epidemic dynamics of different population nodes (regions, cities, neighbourhoods, etc.) is key to succeed in modelling attempts, because we could calculate the sole effect of group size27. This spatially-explicit information, in combination with population size per node and mobility would allow us to use a metapopulation approach in future models28,29,30.

Among the methodology limitations, the most important is that it can only be implemented retrospectively. Thus, these estimates cannot be used to control the pandemic in real time. But considering that different regions or countries are at different stages at the same time, results of this methodology for the first communities could be applied elsewhere. Furthermore, models are valuable tools to improve our knowledge of the pandemic dynamics, including the effectiveness of the different measures adopted to flatten the curve and design safe post-lockdown measures31,32,33,34. In addition, more realistic and accurate models could be obtained by use of the more realistic daily number of infections provided by REMEDID, which, in turn, would improve the model outputs and enhance comparisons of different lockdown and post-lockdown measures.

We believe that REMEDID methodology applied to daily COVID-19 deaths (if accurately reported) or to MoMo ED could be useful to analyse the dynamics of the pandemic retrospectively and more accurately quantify the real daily infections with respect to the official numbers. We only need the CFR and, for greater precision, the PDF of the period from Infections to Death or, alternatively, the PDF of Incubation Period and Illness Onset to Death. This approach could be implemented anywhere, improving our knowledge and understanding of the dynamics of the pandemic and the effectiveness of the confinement measures. This is of high importance to develop successful strategies to face and reduce the effects of future epidemic episodes.