Retrospective methodology to estimate daily infections from deaths (REMEDID) in COVID-19: the Spain case study

The number of new daily infections is one of the main parameters to understand the dynamics of an epidemic. During the COVID-19 pandemic in 2020, however, such information has been underestimated. Here, we propose a retrospective methodology to estimate daily infections from daily deaths, because those are usually more accurately documented. Given the incubation period, the time from illness onset to death, and the case fatality ratio, the date of death can be estimated from the date of infection. We apply this idea conversely to estimate infections from deaths. This methodology is applied to Spain and its 19 administrative regions. Our results showed that probable daily infections during the first wave were between 35 and 42 times more than those officially documented on 14 March, when the national government decreed a national lockdown and 9 times more than those documented by the updated version of the official data. The national lockdown had a strong effect on the growth rate of virus transmission, which began to decrease immediately. Finally, the first inferred infection in Spain is about 43 days before the official data were available during the first wave. The current official data show delays of 15–30 days in the first infection relative to the inferred infections in 63% of the regions. In summary, we propose a methodology that allows reinterpretation of official daily infections, improving data accuracy in infection magnitude and dates because it assimilates valuable information from the National Seroprevalence Studies.

The key parameter to understand and model the evolution of the COVID-19 pandemic is the number of daily infections. Despite its importance, however, it is difficult to estimate reliable data due to the bias in official figures 1 . For symptomatic patients, there are delays of 5.79 days from infection to symptom onset and 5.82 days from symptom onset to diagnosis 2 . In addition, the presence of asymptomatic people and those with mild symptoms hinder their detection. For example, 86% of infections were undetected in Wuhan, China, prior to 23 January 2020, when travel restrictions were imposed 3 . Additionally, not all patients with COVID-19 compatible symptoms are tested, especially at the beginning of the pandemic. So, it is widely assumed that the reported infections are just a fraction of the actual ones. In Spain, a National Seroprevalence Study based on ~ 55,000 random participants estimated that only 10.7% (CI 95% 10.1%-11.3%) of the actual infections had been detected during the first wave (before 22 June 2020) 4,5 .
Another problem is the inconsistency of daily infection time series and the variability of the case definitions. For example, on 13 February 2020, China changed the parameters to confirm new cases and ~ 15,000 new infections were counted in a single day, while daily infections during the previous week were less than 3500 6 . In Spain, the method of counting official numbers of infections has been modified several times since the pandemic was decreed. On 22 April 2020, a total of 213,024 cases were reported from positive PCR and antibody tests 7 ; however, total cases decreased to 202,990 the next day and thereafter, because only positive PCRs were included in the total cases 8 . Accumulated cases on a given day are obtained by adding the new cases that day to the accumulated cases of the previous day. That has not always been the method, however, because some Spanish regions reviewed and modified the cases in previous days 9, 10 . This inconsistency of time series hampers any kind of accurate analysis.

REMEDID.
We named this methodology REMEDID, REtrospective Methodology to Estimate Daily Infections from Deaths and it is described as follows. Date of infection (DI) of an individual may be estimated from the death date (DD) by subtracting the incubation period (IP) and illness onset to death (IOD) periods, hence Therefore, DI can be estimated from DD as far as IP and DD are known; however, neither IP nor IOD are fixed values. On the contrary, they are random variables that can be approximated by probability distributions. From dozens of cases in Wuhan, Linton et al. 11 approximated IP for COVID-19 with a lognormal distribution, X IP , with a mean of 5.6 days and median of 5 days, and IOD with a lognormal distribution, X IOD , with a mean of 14.5 days and median of 13.2 days. Therefore, the infection to death period is a random variable, X IP+IOD , that follows the distribution X IP + X IOD .
Although the addition of two lognormal distributions does not follow any commonly used probability distribution, its probability density function (PDF) can be estimated convolving the PDF of the two variables 12 . If g(t) and h(t) are the PDF of IP and IOD, respectively, then their convolution defines the PDF of X IP+IOD , where t 0 is a positive real number representing days from infection. Then, f(t) is the PDF of the time from infection to death, with a mean of 20.1 days and a median of 18.8 days. Figure 1 shows g(t), h(t), and f(t), as well as a lognormal PDF with the same mean and median as f(t) for comparison. Note that the probability that X IP+IOD is less than or equal to 33 days is 0.95.
Given a time series of deaths produced by the illness, x(t), we estimate the infection time series that produced such deaths, y(t). If we assume a case fatality ratio (CFR) of 100%, y(t) would represent all daily infections. To calculate y(t), we use the following likelihood-based estimation procedure.  12 . Green line, f(t), is the convolution of both and represents the PDF of infection to death. Black dotted line is a lognormal PDF with the same mean and median as f(t). www.nature.com/scientificreports/ Because the relative likelihood that a given infection will produce a death t days later is f(t), the infections at a given time t 0 can be estimated as In practice, x(t) is a discrete time series and could be written as x(n), where n is an integer representing entire days. Let F(n) be a discrete approximation to f(t) as follows: Then, A similar analysis was used previously to estimate a time-variable effective reproduction number for the severe acute respiratory syndrome (SARS) in 2003 13 .
For a different CFR, the associated/estimated infections are estimated from y(n) in Eq. (1) as Equations (1) and (2) constitute the kernel of the REMEDID algorithm. Validation of the capacity of REME-DID to reproduce the number of new daily infections is in Supplementary Material. Note that any delay in death reporting will produce an underestimate of the infections. Then, interpretation of the last days of the time series must be done carefully.

Data. Official numbers of COVID-19 infections and deaths. COVID-19 daily infections and deaths for Spain
are reported by the Centro de Coordinación de Alertas y Emergencias Sanitarias (CCAES), which depends on the Ministry of Health, Social Services and Equality. Time series for the 19 regions of Spain can be downloaded from the national organization that compiles and publishes the data from all regions, the Instituto de Salud Carlos III (ISCIII; https:// covid 19. isciii. es/). Time series for Spain were estimated by adding the time series of the 19 regions. The final access to infections from official data was on 16 February 2021 (hereafter referred as IO 21 ). Those data have been greatly improved with respect to the daily situation reports published by CCAES during the first wave in the first semester of 2020 (hereafter referred as IO 20 ; https:// www. mscbs. gob. es/ profe siona les/ salud Publi ca/ ccayes/ alert asAct ual/ nCov-China/ situa cionA ctual. htm). For example, IO 20 cases were given the date of diagnosis, whereas IO 21 cases were given the date at symptoms onset. When the date of symptoms onset was not available, it was estimated as follows: infections prior to 10 May 2020 were given the diagnosis date minus 6 days, whereas only 3 days were subtracted from the date of diagnosis of infections after 11 May 2020. Asymptomatic cases were given the diagnosis date. These corrections brought reported cases closer to the real infection dates, although they were still delayed by the length of the incubation period. In addition, re-evaluation of the suspicious cases brought out new infections. As a consequence, the first case in IO 21 is set now on 1 January 2020, while it was on 20 February 2020 in IO 20 . The first official death was reported on 13 February 2020, although it was an isolated case. Deaths reported in consecutive days started on 3 March 2020. Then, time spans we considered for IO 21 is from 1 January 2020 to 29 November 2020 and for official daily deaths from 3 March 2020 to 1 January 2021, specifically with 33 extra days as required to apply the REMEDID methodology.
When applying REMEDID the use of a reliable CFR is essential. Thus, to determine a realistic CFR in Spain, we used the total infections data from the longitudinal seroprevalence study conducted by ISCIII in 4 phases. The first three phases were completed from 27 April to 22 June and the results were published on 6 July 2020 4,5 . In this longitudinal cohort study, ~ 68,000 people were randomly selected and ~ 55,000 of them agreed to be interviewed and tested in the three phases. The study found that 5.2% (CI 95% 4.9-5.5%) of the participants presented SARS-Cov2 IgG antibodies. Extrapolation of this result to the ~ 47 million of the population in Spain yielded 2.45 million infections until 22 June, which, with 29,674 accumulated COVID-19 deaths on 22 June, would give a CFR of 1.21% (CI 95% 1.15-1.29%). We assumed this value in our analysis for Spain until 22 June. The fourth phase of the study was completed from 16 to 29 November 2020 and the results published on 15 December 2020 14 . In that study, ~ 51,000 people agreed to participate. The study found that 9.9% (CI 95% 9. (2) Inferred infections(n) = y(n) CFR(n) × 100.  www.nature.com/scientificreports/ than 1.5%. In contrast, Andalucía, Canarias, Melilla, and Murcia had CFRs less than 0.7%. After 22 June, the CFR for Spain diminished in 0.45%, probably due to improved medical treatments and experience gained during the first wave. In this period, the number of regions with a CFR below 0.7% increased to 10 and those with a CFR greater than 1.5% decreased to 3. Most of the regions revealed certain improvement, particularly remarkable was La Rioja, with a CFR decrease from 3.13 to 1.12%.
Excess of expected deaths from MoMo. Expected deaths from any cause, with an error band representing a 99% confidence interval, are modelled by the Mortality Monitoring (MoMo) surveillance system for 22 countries in Europe (www. eurom omo. eu). In Spain, MoMo depends on the ISCIII 15 and data are available at https:// momo. isciii. es/ public/ momo/ dashb oard/ momo_ dashb oard. html# nacio nal. MoMo also collects daily deaths from any cause from the Spanish Instituto Nacional de Estadística (INE) and the notaries and civil registries. Daily deaths within the error estimates of the expected deaths are considered usual. Any deviation allows identification of unusual mortality patterns, as for the COVID-19 pandemic. The useful variable here is the excess deaths (ED), estimated as the difference between actual and expected deaths. The error band for ED can be estimated from the error band of the expected deaths by subtracting the latter from the actual deaths. Note that ED can show negative values (Fig. 2 29 November, the CFR for the country estimated similarly as in "Official numbers of COVID-19 infections and deaths" section was 1.85% (CI 95% 1.74-1.96%) prior to 22 June, and 1.02% (CI 95% 0.97-1.09%) after 23 June. For each region, the CFRs were estimated similarly and are provided in Tables 1 and 2. In all cases, we used a piece wise CFR time series segmented as described in "Official numbers of COVID-19 infections and deaths" section.

Results
COVID-19 vs. MoMo. The COVID-19 official deaths and MoMo ED time series overlaped for the period from 3 March 2020 to 1 January 2021 for Spain and its 19 regions (Fig. 2). In general, there was good agreement between both datasets, meaning that most of MoMo ED were related to COVID-19 deaths. During the first wave, the most important differences were observed in Spain, Madrid, Cataluña, Castilla-La Mancha, and Castilla y León. Before 22 June in Spain, MoMo ED showed 15,445 accumulated deaths more than the official COVID-19 deaths, which is beyond the error band. That difference comes basically from the four regions with the largest numbers of deaths (Madrid, Cataluña, Castilla-La Mancha, and Castilla y León). Table 1 shows the accumulated values before 22 June, which were used to estimate the CFR for Spain and its 19 regions according to the third phase of the National Seroprevalence Study 4,5 . For all regions, the CFR estimated from MoMo ED was larger than the CFR estimated from COVID-19 deaths. In particular, Asturias, Canarias, and Murcia were twice as large. Ceuta and Melilla dramatically increased their CFR from MoMo ED, although that may be biased due to their small populations and numbers of deaths.
Similarly, the same variables for the period from 23 June to 29 November 2020 are reported in Table 2. In Spain, MoMo ED showed 6173 accumulated deaths more than the official COVID-19 deaths. This difference is a third of the difference observed prior to 22 June; because this is within the error band, there was a significant improvement in the detection of COVID-19 deaths in this period. Figure 2 also shows a general agreement between MoMo ED and official COVID-19 deaths time series after the first wave, with the exception of late July and early August. These differences were due to two heat waves that were responsible for at least 25% of the MoMo ED 16 .

Infections estimated from COVID-19 deaths.
To illustrate the delay between official daily infections data and REMEDID estimated daily infections, we applied REMEDID from COVID-19 deaths assuming CFR = 100%. Figure 3 shows the current IO 21 and the infections associated with COVID-19 deaths for the first wave. The latter in Spain reached a maximum on 13 March 2020 (Table 3), the day before the national government decreed a state of emergency and national lockdown. Thus, the adopted measures had an immediate effect, which was observed in the official data IO 21 7 days later (20 March). This delay is similar to the incubation period (mean 5.78 days 2 ), which could be explained because official infections were reported when symptoms appeared. This delay reached 16 days when we compared with earlier version IO 20 (not shown), which highlights the usefulness of the methodology to reinterpret official data from very early stages of the pandemic. On the other hand, the maximum number of deaths was reached on 1 April, which was 19 days after the inferred infection maximum, bringing this delay close to the 20 days expected between infection and death (Figs. 1, 3).
We applied REMEDID to the official COVID-19 deaths with the corresponding estimated CFRs (see "Data" section) to obtain the time series of estimated daily infections, hereafter referred to as IR O . Figure 4 shows IR O and the accumulated infections for Spain and its 19 regions. Note that in Spain, IR O are amplified versions of inferred infections in Fig. 3. In Spain, the first infection, according to IR O, , is on 8 January 2020 (Table 3), 43 days before the first infection was officially reported on 20 February 2020 according to IO 20  www.nature.com/scientificreports/   www.nature.com/scientificreports/  www.nature.com/scientificreports/ infections in January, but in IO 21 only 6 regions had infections in that month. In all scenarios, the first infections were in Madrid and Cataluña. During the first wave, according to IR O most of the regions had maximum daily infections around 14 March. In Madrid, the maximum was reached on 11 March, coinciding with the educational centres closing and an official warning by the regional government (Table 4). Asturias was the last region to reach peak infections (25)(26). The maximum percentage of documented cases (12.6%, CI 95% 9.2-18.4%) occurred in Asturias on 14 March, but in the other regions, only between 1.2 and 8% of the infections were documented (Table 5). Figure 4 shows how the IO 21 and IR O curves of Spain and the 19 different regions fluctuated following the same pattern until the middle of June 2020, but thereafter, they showed different patterns. This reflects the fact that the Spanish government had decreed the control measures for the whole nation until June, but thereafter, each regional government implemented its own control measures. For example, some regions (e.g., Aragón, Islas Baleares, Cantabria, Comunidad Valenciana, Extremadura, Galicia, Murcia, País Vasco, and La Rioja) had two peaks, but others had only one. An apparent maximum on 22 June in Islas Baleares is an artifact produced by the interpolation for transition from the two CFRs. Although beyond the scope of this work, it would be very interesting to investigate the effects of the different control measures implemented on the corresponding IR O for the 19 regions.
The Spanish COVID-19s wave reached a maximum of daily infections on 22 October from IR O and on 26 October from IO 21 . The delay of 4 days is similar to the mean incubation period (5.78 days 2 ). The estimated number of new infections is still larger than the documented cases, but the shapes of the two curves are more similar in the second wave than in the first wave (Fig. S1). The same is true for the 19 regions, most of which had the largest peak around 22-26 October, with the exceptions of Canarias and Madrid, which reached maxima in late August and early September, respectively.

Infections from MoMo excess deaths.
Assuming that MoMo ED accounts for both recorded and nonrecorded COVID-19 deaths, negative deaths are meaningless, and they were set to zero. Then, the associated daily infections can be estimated, as in "Infections estimated from COVID-19 deaths" section, with a CFR of 100% from MoMo ED for Spain (Fig. 5). Note two main differences between this time series and that estimated from official COVID-19 deaths: (1) MoMo data present an error band that was inherited by the estimated infections; (2) MoMo ED estimated infections reached a maximum of 1443 (CI 99% 1329-1547), doubling the 776 inferred daily infections from official COVID-19 deaths in Fig. 3. This is because maximum MoMo ED was 1,584 (CI 99% 1468-1686) and maximum COVID-19 official deaths was 828, both estimated from the 14-day running mean time series. The maximum of inferred infections was reached on 13 March, just one day prior to the state of emergency and lockdown. The expected and observed delays with respect to official infections and MoMo ED www.nature.com/scientificreports/ were similar to those observed for estimated infections from official COVID-19 deaths. Error bounds of the estimated infections in Fig. 5 were computed from the MoMo ED error bounds. However, it should be highlighted that the combination of the error bounds from MoMo ED and the estimated CFRs might lead to unrealistic error estimates. To avoid this, the error estimates in Fig. 6 were estimated from the MoMo ED time series (no error bounds) and the error bounds of the estimated CFRs. The REMEDID was applied to the MoMo ED with the corresponding CFRs (see "Data" section) to obtain the estimated daily infections, which will be referred hereafter as IR M . The IR M were calculated for Spain and its 19 regions and are depicted in Fig. 6, as well as the accumulated IR M . In Spain, the first infection shown by IR M happened on 9 January, with an error estimate from 9 to 10 January, 41 to 42 days before the first documented infection of IO 20 on 20 February 2020 (  (Table 5). Notice that the CFR used with MoMo ED data was larger than the one used with official COVID-19 deaths data, which makes this difference even more remarkable, because the larger the CFR the lower the estimated infections. Therefore, if the true CFRs, which are unknown, were used in both cases, IR M would double IR O on 14 March, as happened when a CFR of 100% was used (Figs. 3, 5). Notice that with the CFRs used, the IR M and IR O resulted in the same accumulated infections on 22 June and 29 November, matching the results of the seroprevalence study. Nevertheless, IR M showed 42 times more cases than IO 20 and 10 times more than IO 21 on 14 March, detection of official cases of only 2.4% (2.2-2.5%) and 9.6% (9.1-10.2%), respectively. Table 3 shows the estimated date of first infection for Spain and by region. Note that the first cases of IR M in Spain were on 9 January and in Aragón, Canarias, and Navarra on 8 January, which is possible because significant excess deaths in a region may not become significant for the whole country. In general, the maxima of daily infections were closer to those on 14 March in IR M than in IR O . During the first wave, all regions showed a single www.nature.com/scientificreports/ maximum, except for Ceuta, Melilla, and Murcia, which showed two maxima (Fig. 6). In general, the IR M time series in all regions were similar during that period. The official data clearly under-detected infections during the first wave. On 14 March, IR M were comparable to IR O , overlapping CI in all regions, but not in Spain as a whole (Table 5). During the second wave, there was improved detection of cases with differences among regions.

Discussion
Infection dynamics is key to understand and model the COVID-19 pandemic. Nevertheless, documented infections in Spain (and nearly worldwide) are of limited quality both qualitatively and quantitatively. Specifically, reported infections were underestimated and delayed 16 days compared to the estimated date of infection during the first months of the pandemic (according to IO 20 ) and 6 days in the current version of the data (IO 21 ). According to the National Seroprevalence Study, only 10.7% of infections were documented in Spain prior to 22 June 2020 4,5 and 64.3% from 23 June to 29 November 2020 14 . The huge underreporting during the first wave is mainly due to: (1) the lack of testing for asymptomatic and mildly symptomatic people 3 , which could have been detected with a "test, track and trace" strategy, as done in South Korea, China, and Singapore 17 ; (2) deaths outside of the hospitals, with many cases from nursing homes for the elderly, where 6664 deaths with COVID-19 and 7082 with similar symptoms, but not officially diagnosed with COVID-19, occurred from January to May 2020 18 . This poor quality of data hinders any options to study the infection dynamics based on official data. To overcome these difficulties, we propose the REMEDID algorithm, which allows calculation of time series for daily infections from daily death time series, a more reliable source, which can be applied in early stages of any COVID-19 outbreak with only 33 days delay. The REMEDID algorithm needs only three inputs: (1) PDF of the period from date of infection to date of death. In this study, that was calculated by combining the PDF of the Incubation Period and the Illness Onset to Death period, as estimated by Linton et al. 11 for cases in Wuhan, China. The incubation period can be assumed to be similar in China and elsewhere, because it is a characteristic of the virus, but the illness onset to death period could be affected by the type of health care and, thus, by each national health system 2 or even ethnicity 19 . No such information is available for Spain to date, but it would be desirable to use a local distribution of Illness Onset to Death when available for Spain. The calculations could be redone easily by applying the MATLAB code available in GitHub (https:// github. com/ isavig/ REMED ID). (2) Time series of daily deaths. We used two sources, the official COVID-19 deaths and the MoMo ED, because official COVID-19 deaths have been underestimated, especially during the first wave in the regions of Castilla-La Mancha, Castilla y León, Cataluña, and Madrid. Such underestimation was due to the fact that non-hospital deaths and cases during the early pandemic were somewhat undertested ( Table 1 , Fig. 2 ). We expect that undocumented COVID-19 deaths appeared in the excess of expected deaths from any cause calculated from MoMo 15 . Interpretation of those results must be made carefully because ED is a statistical variable calculated from what happened in previous years and some expected deaths, such as those produced by traffic and work accidents, may have been reduced due to the national lockdown that began on 14 March 2020 and gradually has been eased. Nonetheless, expected fatalities were very low (traffic and work accidents less than 5 deaths daily 20,21 ) compared to the daily death toll fromCOVID-19 (300 deaths per day average from March to May). Some non-COVID-19 deaths may have been influenced by other pandemic-related factors 22 , such as breakdown in medical follow-up (especially chronic conditions and among older patients), delays in attending to the healthcare appointments and/or being attended due to  We ran the REMEDID algorithm to provide the estimated time series of infections for Spain and for its 19 regions for the period from 8 January to 29 November 2020. These time series provided valuable information to understand the time evolution of the pandemic. Our main findings for Spain are: (1) On 14 March, when the national government decreed the state of emergency and national lockdown, estimated infections were between 35 and 42 times larger than officially documented ones at that time, and 9-10 times more than current official data. (2) The national lockdown had a strong effect on the transmission of the virus, as shown by a rapid slope decrease around day 13 March. (3) The first infection was better determined from inferred infections than from official records during the first months of the pandemic.
The REMEDID algorithm has strengths and limitations. First, it uses the number of deaths, which are more accurately recorded than infections. Second, it allows elucidation of the date of the infections estimated during the seroprevalence studies, which only determines the total number of infections. Thus, the REMEDID algorithm complements the seroprevalence studies. Third, the estimated daily infections place the probable date of infection more accurately than the official numbers. Note that official infections are delayed by the incubation period (from infection to illness onset). Therefore, although the maxima of infections theoretically should have coincided with the state of emergency and the national lockdown on 14 March, official infections in Spain were maximum 6 days later. Infections estimated from death data do not show such delay, however. Determination of the actual day of infection is very important for evaluation of the immediate effect of the implemented countermeasures. In Madrid the estimated infections reached their maximum on 11 March when regional government warned the population to stay at home and schools and universities were closed, forcing 1.2 million students to stay at home. Moreover, overall recommendations on disease control and social distancing were given by the Ministry of Health to the general public since the beginning of March (see https:// www. mscbs. gob. es/ en/ profe siona les/ salud Publi ca/ ccayes/ alert asAct ual/ nCov-China/ ciuda dania. htm).
Four, the REMEDID algorithm can be applied to any region, regardless of population size, but being cautious with small populations. For example, in this study we applied REMEDID methodology to a small city (Ceuta with 84,777 inhabitants), a medium size region (Madrid with more than 6.6 million inhabitants), and a country (Spain with 47 million inhabitants). This versatility allows study of different epidemic dynamics as reflected in the variety of shapes and slopes in the curves depicting death and infection records, that show the heterogeneity of the epidemic at each region with different population size and density, social behaviour and transmission pattern, especially during the second wave, showing unique dynamics that must be addressed individually. Knowledge of the real epidemic dynamics of different population nodes (regions, cities, neighbourhoods, etc.) is key to succeed in modelling attempts, because we could calculate the sole effect of group size 27 . This spatially-explicit information, in combination with population size per node and mobility would allow us to use a metapopulation approach in future models 28 www.nature.com/scientificreports/ Among the methodology limitations, the most important is that it can only be implemented retrospectively. Thus, these estimates cannot be used to control the pandemic in real time. But considering that different regions or countries are at different stages at the same time, results of this methodology for the first communities could be applied elsewhere. Furthermore, models are valuable tools to improve our knowledge of the pandemic dynamics, including the effectiveness of the different measures adopted to flatten the curve and design safe post-lockdown measures [31][32][33][34] . In addition, more realistic and accurate models could be obtained by use of the more realistic daily number of infections provided by REMEDID, which, in turn, would improve the model outputs and enhance comparisons of different lockdown and post-lockdown measures.
We believe that REMEDID methodology applied to daily COVID-19 deaths (if accurately reported) or to MoMo ED could be useful to analyse the dynamics of the pandemic retrospectively and more accurately quantify the real daily infections with respect to the official numbers. We only need the CFR and, for greater precision, the PDF of the period from Infections to Death or, alternatively, the PDF of Incubation Period and Illness Onset to Death. This approach could be implemented anywhere, improving our knowledge and understanding of the dynamics of the pandemic and the effectiveness of the confinement measures. This is of high importance to develop successful strategies to face and reduce the effects of future epidemic episodes.