Introduction

COVID-19 was first diagnosed in Wuhan, China in December 2019 and was quickly declared a pandemic by the World Health Organization on March 11, 20201. The first case in India was declared on January 30, and as of April 4, 2021, there have been 12,587,921 cases and 165,132 deaths reported2. India responded quickly, instituting a nationwide lockdown on March 25, when there were only 657 cases and 11 deaths2,3. Epidemiologic models can be used to monitor disease rates and inform public health interventions, but data quality will impact the ability of models to make accurate predictions. Underreporting of cases and deaths attributable to SARS-CoV-2 infection has hindered modeling efforts. This underreporting is primarily due to limited testing, deficiencies in the reporting infrastructure and a large number of asymptomatic infections.

Classical epidemiologic models, such as the Susceptible-Exposed-Infected-Removed (SEIR) compartmental model, have been used to predict the trajectory of the COVID-19 pandemic. For example, a modification of the standard SEIR model applied to Wuhan data and accounting for pre-symptomatic infectiousness, time-varying ascertainment rates, transmission rates and population identified that the outbreak had high covertness and high transmissibility4. This work estimated that 87% (with a lower bound of 53%) of the infections in Wuhan before March 8 were unascertained4. However, traditional SEIR models do not account for imperfect testing5,6,7. Individuals with a false negative diagnostic test will also remain unascertained and contribute to the compartment of latent unreported cases in a SEIR model.

It is important to clarify that there are two classes of tests that are being discussed in the literature and are relevant to this paper: diagnostic tests and antibody tests. A diagnostic test (typically an RT-PCR test) is used to identify the presence of SARS-CoV-2, indicating an active infection8. An antibody test (i.e., a serology test) looks for the presence of antibodies, the body’s immune response to fight off SARS-CoV-2, indicating a past infection9. Figure 1 presents a timeline in terms of when these tests are administered during the course of an infection. Due to a large number of asymptomatic cases and limited number of tests, many infections do not get detected. Population-based seroprevalence surveys, therefore, give us an idea about the “true” number of infections including reported and unreported cases, and consequently, the ascertainment rate10. Thus, adjusted estimates of total number of cases and ascertainment rates based on serological surveys, when available, provide an option to validate model-based estimates of unreported cases and ascertainment rates. These estimates would usually be impossible to validate (except for in a simulation study) since these numbers are not observable in the real data.

Figure 1
figure 1

Timeline of COVID-19 diagnostic and antibody testing with respect to the infection and immune response time frame.

Both diagnostic and antibody tests suffer from the issue of false negatives and false positives. For the RT-PCR test, a false negative is more worrisome since that means allowing an infected person a false safety assurance. In contrast, a false positive from an antibody test is of greater concern, since it gives the false impression that the person has been infected in the past, has gained some protection from the virus, and is unlikely to be infected again. The RT-PCR test is quoted to have a high false negative rate, ranging from 15 to 30% (i.e., low sensitivity, 85–70%), and a low false positive rate around 1–4% (i.e., high specificity, 99–96%)11. The antibody test assays are more precise—the commercial assays have sensitivity around 97.6% and specificity of 99.3% (DiaSorin) at about 15 days after infection10.

To address these data quality issues and the high rate of asymptomatic COVID-19 cases, we develop an extension to a standard SEIR model incorporating false negative rates in diagnostic testing to predict both the numbers of unreported cases and deaths and to estimate the rate at which COVID-19 cases and deaths are being underreported (unascertained). Our method segregates the traditional infected compartment into tested/untested and true positive/false negative compartments, thus accounting directly for misclassifications due to imperfections in the RT-PCR diagnostic tests. We apply this false negative-adjusted SEIR model to predict the transmission dynamics of SARS-CoV-2 in Delhi, the national capital region of India and one of the hotspots of COVID-19 in the country, using data from March 15 to June 30, 2020 for our original set of calculations and an updated range from March 15 to December 31, 2020 for another updated set of calculations. We make predictions across a range of possible sensitivities for the diagnostic test, all assuming perfect specificity.

To understand the true extent of spread of the novel coronavirus, the National Centre for Disease Control (NCDC) in India have performed five rounds of serological surveys in Delhi, among several such studies conducted across the world (Table 1)12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,44. While limited on reported details, the first round of the Delhi Serology Study collected 21,387 random samples across 11 districts in Delhi between June 27 and July 10, 2020 and found COVID-19 antibodies present in 22.86% of samples12,18. This seroprevalence is the highest among the studies till July 2020 summarized in Table 1 but is similar to that found in New York City (22.70%), another large, densely populated area42. This indicates that Delhi, during July 2020, had high seroprevalence, even compared to worldwide epicenters and hotspots of COVID-19. The fifth and latest round of the Delhi serology study which collected 28,000 random samples across all the 272 municipal wards of Delhi between January 15–23, 2021 found an even higher seroprevalence, estimated at 56.13%20. This seroprevalence is the third highest among those summarized in Table 1, falling only behind the serosurvey in UK and that in Paris, France (among the working population)31,39. These numbers show that Delhi continued to have high seroprevalence among all the COVID-19 hotspots worldwide even in very recent times.

Table 1 Summary of COVID-19 seroprevalence studies across the world.

A simple proportional estimate based on these reported seroprevalences would tell us that Delhi, with approximately 19.8 million people, had around 4.6 million cumulative cases by July 10, 2020, and around 11.1 million cumulative cases by January 23, 2021. These numbers contrast sharply with the 109,140 cumulative cases (3,300 total deaths) reported in Delhi as of July 10, 2020 and the 633,739 cumulative cases (10,799 total deaths) as of January 23, 2021, which represent, respectively, approximately 0.55% and 3.20% of Delhi’s population. This disparity suggests that only about 2.4% of cases were being detected (underreporting factor of about 42) as of July 10, 2020, and as of January 23, 2021, that percentage has improved to only 5.7% (underreporting factor of about 18). The seroprevalence estimate also implies that the infection fatality rate (IFR) for Delhi was of the order of 0.07% or 717 per million as of July 10, 2020, which updates to 0.10% as of January 23, 2021. This IFR seems low compared to estimates worldwide45 and as such it may be reasonable to argue that some COVID-related deaths may also be unreported, or the cause of death misclassified. Uncertainty regarding the reporting of death data is further supported by the very small fraction of deaths in India that are medically reported46 and that the IFR estimates for SARS-CoV-2 from other studies in the world45 appear to be higher than influenza (as of 2018–19, infection fatality rate of influenza is at 961 per million or around 0.1%)47.

The availability of several rounds of seroprevalence estimates from the Delhi serology study provides a unique opportunity to validate model-predicted rates of latent unreported infections for our proposed false negative-adjusted SEIR model. The ELISA assay used in the Delhi serosurvey is a customized assay, some discussions on the development and imperfections of which are available both in recent literature and public media domain48,49. Based on these known imperfections, we perform adjustments of the reported case counts/infection estimates under different sensitivity and specificity assumptions for both the diagnostic and antibody (Ab) tests and compare the model-based estimates for the extent of underreporting to those obtained from the seroprevalence-based calculations. Other derived metrics such as case fatality rates and infection fatality rates are also presented. We use reported COVID-19 case and death count data from covid19india.org2. This framework can be adapted and applied to any set of reported case-counts where imperfect and limited testing exists.

Results

Extended SEIR model adjusted for misclassification

Figure 2 provides a schematic diagram of the proposed false-negative adjusted SEIR model. Under low (0.7), medium (0.85), meta-analyzed (0.952)50 and perfect (1) sensitivity, and perfect (1) specificity assumptions for the RT-PCR diagnostic test, we predict the total (reported and unreported) cases and deaths for Delhi using the proposed extended SEIR model.

Figure 2
figure 2

Diagram describing model compartments and transmissions for the extended SEIR model. For the detailed descriptions of the compartments and parameters, please refer to Supplementary Table 2 and the “Methods” section.

Using data through June 30, 2020, this model estimates 4.8 million cases and 33,165 deaths on July 10, 2020 if we assume the RT-PCR test has a sensitivity of 0.85, and those predicted counts become 4.2 million and 28,499, respectively, if the sensitivity is assumed to be 1.0 (Fig. 3a). In contrast, the observed case and death counts are 109,140 and 3,300 reported in Delhi as of July 10, 20202. Examining the ratio of predicted total number of cases and the predicted number of reported cases on July 10, 2020, the estimated case underreporting is within the range of 34–53 and the same quantity for underreported deaths is between 8 and 13 (Supplementary Table 1). According to this model, 97–98% of Delhi’s cases remain undetected as of July 10, 2020. The model predictions under the different scenarios considered and the results relative to daily reported/total case and death counts are summarized in Supplementary Figs. 1a and 2a.

Figure 3
figure 3

Summary of cumulative total (reported and unreported) cases and deaths for four different assumed values of sensitivity for the diagnostic RT-PCR test: 0.7, 0.85, 0.952, 1. In each subfigure, panels A and B respectively summarize the cases and deaths, along with their reported observed counterparts. The specificity of the diagnostic test is assumed to be 1. (a) Projections based on training data during March 15 to June 30, 2020, and testing period between June 1 to July 26, 2020. (b) Projections based on training data during March 15 to December 31, 2020, and testing period between January 1 to March 15, 2021.

Using data through December 31, 2020, this model estimates 10.2 million cases and 45,004 deaths on January 23, 2021 if we assume the RT-PCR test has a sensitivity of 0.85, and those predicted counts become 8.0 million and 34,949, respectively, if the sensitivity is assumed to be 1.0 (Fig. 3b). In contrast, the observed case and death counts are 633,739 and 10,799 reported in Delhi as of January 23, 20212. Examining the ratio of predicted total number of cases and the predicted number of reported cases on January 23, 2021, the estimated case underreporting is within the range of 13–22 and the same quantity for underreported deaths is between 3 and 7 (Supplementary Table 1). According to this model, 92–96% of Delhi’s cases remain undetected as of January 23, 2021. The model predictions under the different scenarios considered and the results relative to daily reported/total case and death counts are summarized in Supplementary Figs. 1b and 2b.

Future projections and variation of the underreporting factor through the course of the pandemic

We extended our projections of unreported case counts and the underreporting factors prospectively. Our projections for August 15, 2020 predict between 6.5 and 9.6 million cumulative (reported and unreported) cases in Delhi (across low to high false negative rate scenarios for the diagnostic test) (Supplementary Table 1). This provides us with a range of 35–54 for the underreporting factors for cases and a range of 8–13 for underreporting of deaths on August 15, 2020 (Supplementary Table 1). The temporal changes in the daily estimated case underreporting factors throughout the course of the pandemic is another crucial feature captured by our projections, as can be seen in Supplementary Fig. 3a. For the low (0.7) sensitivity scenario, the estimated case underreporting factor is 34 on June 1, 2020, the beginning of the first unlock period. This increases to 49 for June 20, 2020, when the daily number of tests and reported cases both increased.

The updated set of projections for February 15, 2021 predict between 8.1 and 13.8 million cumulative (reported and unreported) cases in Delhi (across low to high false negative rate scenarios for the diagnostic test) (Supplementary Table 1). This provides us with a range of 13–22 for the underreporting factors for cases and a range of 3–7 for underreporting of deaths on February 15, 2021 (Supplementary Table 1). Notably, our projections indicate that the underreporting factor for total cases is approximately constant over the period of January 1 to March 15, 2021, as can be seen in Supplementary Fig. 3b. For the low (0.7) sensitivity scenario, the estimated case underreporting stays at 22 throughout this period, and for the perfect (1.0) sensitivity scenario, this number decreases to 13.

Naïve corrections to reported test results using known misclassification rates for tests

Since the total (reported and unreported) number of cases and subsequently, the underreporting factor, are not part of the observed data and therefore our SEIR model estimates cannot directly be validated, we validate these estimates using the estimated number of true infections predicted by the serosurvey data. However, the antibody tests are also imperfect and as such we also correct the seroprevalence estimates for imperfect testing.

Using varying (low to perfect) sensitivities and specificities for the diagnostic and antibody tests, we estimate that the true case count in Delhi as of July 10, 2020, lies between 4.4 and 4.6 million, which represents 30 to 42 times the number of reported cases (Table 2a). These estimates strongly agree with model-based findings as reported in the previous subsection, indicating that 96–97% cases in Delhi were underreported. Our updated estimate for the true case count in Delhi as of January 23, 2021 lies between 11.1 and 11.9 million, representing 17 to 21 times the number of reported cases (Table 2b). Again, these estimates are in agreement with the model-based estimates from the previous subsection, indicating that 94–95% cases in Delhi remained unreported even as recently as January 23, 2021.

Table 2 Summary of corrected number of cases, estimated underreporting factor, case-fatality rate based on reported cases and infection-fatality rate across different testing scenarios. Population size of Delhi is obtained from https://censusindia.gov.in/, and the testing, infection, recovery and fatality data are extracted from https://covid19india.org/.

Case fatality rate (CFR) and infection fatality rate (IFR)

The sensitivity and specificity of the diagnostic test impact our estimate of the case-fatality rate (\(\frac{\#deaths}{\#reported cases}\)), but not the infection-fatality rate (\(\frac{\#deaths}{\#true infections}\)). We estimate that the CFR lies between 2.24–3.06% as of July 10, 2020 (Table 2a), and between 1.40 and 1.91% as of January 23, 2021 (Table 2b). On the other hand, the sensitivity and specificity of the antibody test impact our estimates of the IFR. We estimate that the IFR lies between 0.07 and 0.08% based on the reported death counts as of July 10, 2020, and between 0.09 and 0.10% based on that as of January 23, 2021 (Table 2).

If we consider the hypothetical scenario of tenfold underreporting of deaths, as suggested by the SEIR model outputs (a range of 8–13), the infection-fatality rate estimate increases to 0.7–0.8% for July 10, 2020 (Table 2a). The updated SEIR model outputs indicate a range of 3–7 for the underreporting factor for deaths, and assuming fivefold underreporting of deaths, the adjusted infection-fatality rate estimate lies between 0.4 and 0.5% for January 23, 2021 (Table 2b). We are not able to perform any validation for the estimated underreporting factor for deaths as we do not have estimates of true death rates or excess deaths.

Discussion

We developed an extension of the standard SEIR compartmental model to adjust for imperfect diagnostic testing. Applying our model on publicly available case and death count data for Delhi, we estimated the underreporting factor for cases to be somewhere between 34 and 53 and that for deaths to be somewhere between 8 and 13 on July 10, 2020 (with updated estimated ranges of 13–22 and 3–7 respectively on January 23, 2021). We obtained adjusted estimates of the underreporting factor for cases using the seroprevalence study (30–42 on July 10, 2020 and 17–21 on January 23, 2021), which largely agreed with those estimated from the model. Further, the estimated underreporting factors were seen to be more stable over an extended period of time with the new set of training data and testing period compared to the original calculations. Having an accurate idea about the underreporting factor and the extent of spread is extremely helpful in terms of tracking the growth of the pandemic and determining intervention policies. Since repeated serological surveys to track the ever-evolving seroconversion scenario are rarely viable options due to high expense in terms of cost, resources, and time, model estimates updated regularly with new incoming data provide an opportunity to monitor the underreporting factor and unreported cases and deaths.

Limitations

(1) Our SEIR model incorporates only false negatives of the diagnostic tests but not false positives. We are more concerned about false negatives as this gives a false sense of safety to a patient and may increase the likelihood the person will engage in activities that will spread the disease. In addition, the false positive rates are quite low for PCR tests11. (2) We have refrained from incorporating a time-varying recovery rate in our model for several reasons. First, recovery data from India is not quite accurate and there is often a “catch up” period. The definition of recovery (e.g., negative COVID test, no symptoms) is also variable. As such, this may induce more noise. Second, modeling recoveries better change our estimate for “active” cases but does not affect what we consider in this paper, cumulative cases reported up to a give date. Third, including more time-varying parameters in the model will complicate the model further, and depending on the availability and quality of the recovery data, it may yield unstable/questionable fits. Finally, without directly considering the recovery rate to be time-varying, it is possible to effectively capture changes in the recovery rate by modifying one of the other parameters affecting recovery rate, like the mortality rate on which we have more data. For instance, one further generalized version of our model offers an option for time-varying mortality rate which has the potential to capture time-varying recovery51. (3) We used the seroprevalence estimate as a parallel, independent way of validating our model findings. An alternative approach for using serosurvey data is to introduce quarantine and immune compartments in the model structure and assume that symptomatic individuals are identified and successfully isolated with a given average delay from the onset of symptoms and that recovered individuals are never susceptible to an infection again52. We have not compared our method with this approach. (4) The implications of any such model-based adjustments depend heavily upon the reliability of the reported seroprevalence information. To that end, it is important to mention that many pertinent details were not released publicly in the first and fifth (latest) phases of the Delhi NCDC serology survey, such as the response and positivity rates stratified by age, sex, job type, district; sampling design and so on. A single reported number for the seroprevalence (22.86% and 53.16% respectively for the 1st and the 5th Delhi serosurveys) without sufficient detail on the survey design and assay used has limited use. (5) We do not know if individuals with antibodies are protected from re-infection, how long this protection lasts, the antibody levels needed to protect us from re-infections53, or whether a person with the antibody can still be contagious or show severe symptoms. The positive news from our estimates is that a large number of people in Delhi had the infection without feeling severe symptoms or needing clinical care.

Conclusion

There have been debates about the path towards achieving herd immunity in India. The estimated range for the herd immunity threshold lies within 44–73% (based on worldwide estimated basic reproduction number of 1.8–3.8)54,55. For Delhi, and possibly even more so for other parts of India, herd immunity seems to be attainable as of recent dates but is certainly not a panacea we can rely on. Even based on the IFR obtained without adjusting for potential death underreporting and trusting the reported death counts as of January 23, 2021 (Table 2), if 50% of the 1.38 billion people in India get infected (a concept that many proponents of herd immunity have suggested), this would imply an estimated 690,000 deaths. This estimate skyrockets to a staggering 3.0–3.5 million deaths if we believe the current estimated underreporting factor for death from our proposed model. Although we could not validate the estimated underreporting factor for death, the quality of the reported death data is questionable. For example, a mid-2020 study attempting to model COVID-19 fatalities stratified by age-groups indicates that at least 1500–2500 deaths in Delhi in the 60 + age group have not been reported56. The high estimate of fatalities when adjusted for underreporting, along with these evidence for underreporting of deaths in India, calls for cautious actions, as India is beginning to see a second wave of the pandemic as recently as the beginning of April 202157. Strong policy decisions directed towards containment of the new surge in infections and logistically efficient vaccination strategies are the need of the hour in this regard.

The appearance and spread of COVID-19 have taken the entire world by a storm, but a large number of examples from all across the world clearly depict that we can change the narrative and course of this virus through extensive testing, contact tracing, use of masks, hand hygiene and social distancing. For example, Delhi has seen tremendous success in turning the corner of the virus curve, with the time-varying reproduction number staying below unity for the larger part of the period between September 2020 and February 2021 (Supplementary Figs. 4–5). This trend of improved containment, however, seems to have reversed in the recent times, with the estimated time-varying reproduction number undergoing an alarming increase above unity during March–April 2021 (Supplementary Fig. 5). Several factors including public complacency, waning immunity that was acquired from past infections and the emergence of new variances may have contributed to this surge58. The appearance of these escalated numbers also calls for closer inspections of the serosurvey-based estimates, since a \(>50\%\) seroprevalence and a spike in the number of new infections are theoretical antipodes in the context of a pandemic. Multiple potential reasons behind emerging biases in serosurvey estimates including non-representative sampling and assay characteristics have been discussed in recent literature, alongside possible ways of adjusting for such bias59,60.

Rapid and significant scientific advancements in both clinical and public health interventions have been made over the past year61. Data-driven policy decisions are crucial at this juncture. Our analytical framework for integrating diagnostic testing imperfections in the context of estimating unreported cases provides an alternative to conducting frequent serosurveys in Delhi. Validation of epidemiological model outputs against seroprevalence estimates inspires confidence in our inference and will hopefully prove to be a useful strategy for other case-studies.

Methods

Extended SEIR model adjusted for misclassification

We developed an extension of a standard SEIR model. In this model, the susceptible individuals (S) become exposed (E) when they are infected. After a latency period, exposed individuals are able to infect other susceptible individuals and are either untested (U) with probability \(r\) or tested (T) with probability \(1-r\). Tested individuals enter either the false negative compartment (F) with probability \(f\) or the (true) positive compartment (P) with probability \(1-f\). Individuals who are in the untested and the false negative compartments are considered unreported COVID-19 cases and enter either the recovered unreported (RU) or death unreported (DU) compartments. Similarly, those who tested positive move to either a recovered reported (RR) or death reported (DR) compartment. Figure 2 represents the SEIR model schematic, with arrows representing the possible transitions individuals in each compartment can undergo. The corresponding system of differential equations is presented below. The parameters and their initialization values used are described in Supplementary Table 2.

  • \(\frac{\partial S}{\partial t}=-\beta \frac{S\left(t\right)}{N}\left({\alpha }_{P}P\left(t\right)+{\alpha }_{U}U\left(t\right)+ F\left(t\right)\right)+\lambda -\mu S\left(t\right).\)

  • \(\frac{\partial E}{\partial t}=\beta \frac{S\left(t\right)}{N}\left({\alpha }_{P}P\left(t\right)+{\alpha }_{U}U\left(t\right)+F\left(t\right)\right)-\frac{E\left(t\right)}{{D}_{e}}-\mu E\left(t\right).\)

  • \(\frac{\partial U}{\partial t}=\frac{(1-r)E(t)}{{D}_{e}}-\frac{U\left(t\right)}{{\beta }_{1}{D}_{r}}-{\delta }_{1}{\mu }_{c} U\left(t\right)-\mu U\left(t\right).\)

  • \(\frac{\partial P}{\partial t}=\frac{r(1-f)E(t)}{{D}_{e}}-\frac{P\left(t\right)}{{D}_{r}}-{\mu }_{c}P\left(t\right)-\mu P\left(t\right).\)

  • \(\frac{\partial F}{\partial t}=\frac{rfE(t)}{{D}_{e}}-\frac{{\beta }_{2}F\left(t\right)}{{D}_{r}}-\frac{{\mu }_{c} F\left(t\right)}{{\delta }_{2}}-\mu F\left(t\right).\)

  • \(\frac{\partial RU}{\partial t}=\frac{U(t)}{{\beta }_{1}{D}_{r}}+\frac{{\beta }_{2}F(t)}{{D}_{r}}-\mu RU\left(t\right).\)

  • \(\frac{\partial RR}{\partial t}=\frac{P\left(t\right)}{{D}_{r}}-\mu RR\left(t\right).\)

  • \(\frac{\partial DU}{\partial t}={\delta }_{1}{\mu }_{c}U\left(t\right)+\frac{{\mu }_{c}F\left(t\right)}{{\delta }_{2}}.\)

  • \(\frac{\partial DR}{\partial t}={\mu }_{c}P\left(t\right).\)

Here, \(X(t)\) denotes the number of individuals in the compartment of interest \(X\) at time \(t\). Based on this set of differential equations, we calculate the basic reproduction number of the proposed model using the Next Generation Matrix Method62. The expression for \({R}_{0}\) turns out to be the following:

$${R}_{0}=\frac{\beta {S}_{0}}{\mu {D}_{e}+1}\left(\frac{{\alpha }_{u}\left(1-r\right)}{\frac{1}{{\beta }_{1}{D}_{r}}+{\delta }_{1}{\mu }_{c}+\mu }+\frac{{\alpha }_{p}r\left(1-f\right)}{\frac{1}{{D}_{r}}+{\mu }_{c}+\mu }+\frac{rf}{\frac{{\beta }_{2}}{{D}_{r}}+\frac{{\mu }_{c}}{{\delta }_{2}}+\mu } \right).$$

Here, \({S}_{0}=\frac{\lambda }{\mu }=1\), since we have assumed natural birth and death rate to be equal within this short period of time. In this setting, both \(\beta\) and \(r\) are time-varying parameters which are estimated using the Metropolis–Hastings MCMC method63. To estimate the parameters, we first need to be able to solve the differential equations, which is difficult to perform in this continuous-time setting. It is also worth noting that we do not require the values of the variables for each time point. Instead, we only need their values at discrete time steps, i.e., for each day. Thus, we approximate the above set of differential equations by a set of recurrence relations. For any compartment \(X\), the instantaneous rate of change with respect to time \(t\) (given by \(\frac{\partial X}{\partial t}\)) is approximated by the difference between the counts of that compartment on the \({\left(t+1\right)}^{th}\) day and the \({t}^{th}\) day, that is \(X\left(t+1\right)-X(t)\). Starting with an initial value for each of the compartments on the Day 1 and using the discrete-time recurrence relations, we can then obtain the solutions of interest. Some examples of these discrete-time recurrence relations are presented below.

  • \(E\left(t+1\right)-E\left(t\right)=\beta \frac{S\left(t\right)}{N}\left({\alpha }_{P}P\left(t\right)+{\alpha }_{U}U\left(t\right)+F\left(t\right)\right)-\frac{E\left(t\right)}{{D}_{e}}-\mu E\left(t\right),\)

  • \(U\left(t+1\right)-U\left(t\right)=\frac{\left(1-r\right)E\left(t\right)}{{D}_{e}}-\frac{U\left(t\right)}{{\beta }_{1}{D}_{r}}-{\delta }_{1}{\mu }_{c} U\left(t\right)-\mu U\left(t\right),\)

  • \(P\left(t+1\right)-P\left(t\right)=\frac{r\left(1-f\right)E\left(t\right)}{{D}_{e}}-\frac{P\left(t\right)}{{D}_{r}}-{\mu }_{c}P\left(t\right)-\mu P\left(t\right),\)

  • \(F\left(t+1\right)-F\left(t\right)=\frac{rfE\left(t\right)}{{D}_{e}}-\frac{{\beta }_{2}F\left(t\right)}{{D}_{r}}-\frac{{\mu }_{c} F\left(t\right)}{{\delta }_{2}}-\mu F\left(t\right).\)

The rest of the differential equations can each be similarly approximated by a discrete-time recurrence relation. These parameters are estimated using training data from Delhi from March 15 to June 30, 2020 for our first set of analyses, and from March 15 to December 31, 2020 for our updated set of analyses. The training data were divided into multiple periods, in accordance with the lockdown and unlock procedures employed by the government of India, as described in Supplementary Table 3. Using these, we obtained predictions for dates ranging from June 1 through August 15, 2020, for the first set of analyses, and between January 1 to March 15, 2021 for the updated set of analyses. Since we used an MCMC algorithm to estimate the parameters and the posterior means of the compartment sizes, it is easy to obtain empirical posterior credible intervals based on the full set of MCMC draws to quantify the uncertainty associated with these estimates and projections. However, we deliberately refrained from reporting the uncertainty estimates in this paper to avoid intricacies in presentation of the results that may hinder the central message. Further, we assumed the RT-PCR test specificity to be 1 and did not incorporate false positives arising from the diagnostic test to avoid additional assumptions for model identifiability.

Naïve corrections to reported test results using known misclassification rates

Notations: Let N = population size, X = number of true cases in the population (hence N – X = number of non-cases in the population), T = number of people tested, S = number of true cases tested (hence T – S = number of non-cases tested, X – S = number of true cases not tested, N – X – T + S = number of non-cases not tested), P = number of positive tests (also, therefore, cumulative number of reported cases, hence T – P = number of negative tests). Note that X and S are the only two unknowns in this setting. Also, let us assume that the sensitivity of the test of interest is \(\boldsymbol{\alpha }\) and the specificity of the same is \(\beta\). With that, we can set up the following equation, because there are two ways a test can be positive, as can be seen in Supplementary Fig. 6.

$$P=S\times \alpha +\left(T-S\right)\times \left(1-\beta \right)\Rightarrow \frac{P}{T}=\frac{S}{T}\times \alpha +\left(1-\frac{S}{T}\right)\times \left(1-\beta \right).$$

Adjusting the terms, we get the following expression for \(S\).

$$S=T\times \frac{\frac{P}{T}+\beta -1}{\alpha +\beta -1}.$$

Assuming that the proportion of cases among those tested stays the same as the original population (random and hence homogenous testing), we can replace \(S\) by \(\frac{TX}{N}\), which will lead to the following updated equation.

$$\frac{P}{T}=\frac{X}{N}\times \alpha +\left(1-\frac{X}{N}\right)\times \left(1-\beta \right).$$

Solving this, we get the following expression for \(X\).

$$X=N\times \frac{\frac{P}{T}+\beta -1}{\alpha +\beta -1}.$$

Thus, these two expressions give us, for a given set of \(\alpha\) and \(\beta\), the corrected number of reported cases (\(S\)), and also the estimated number of true (reported and unreported) cases (\(X\)). For the computation of \(S\), we use \(\frac{P}{T}=\frac{\mathrm{109,140}}{\mathrm{747,109}}\approx 0.146\), the test positive rate of the RT-PCR tests in Delhi as of July 102. For the computation of \(X\), we use \(\frac{P}{T}=\frac{\mathrm{4,889}}{\mathrm{21,387}}\approx 0.229\), the positive rate reported by the first round of the Delhi serological survey12,13,14. For the updated analysis based on more recent data, these numbers are updated to \(\frac{\mathrm{644,064}}{\mathrm{10,289,461}}\approx 0.062\) and \(\frac{\mathrm{15,716}}{\mathrm{28,000}}\approx 0.561\) respectively. Once we get these estimates, we can compute the adjusted underreporting factor as \(URF=\frac{X}{S}\). Also, assuming that \(D\) denotes the cumulative number of deaths till a date of interest, we can compute the corrected versions of case fatality rate and infection fatality rate as \(CFR=\frac{D}{S}\) and \(IFR=\frac{D}{X}\), respectively. Further, if we want to adjust for a potential scenario where for every M death due to COVID-19, we observe 1 death (M-fold underreporting for deaths), we can update the IFR estimate as \(IFR=\frac{MD}{X}\). We calculate our adjusted IFR estimates for \(M=10\) for the July 10, 2020 computations, and for \(M=5\) for the January 23, 2021 computations. Based on the data from Delhi, we use \(D=3300\) for July 10, 2020, and \(D=\mathrm{10,994}\) for January 23, 20212. We also use a population size of \(N=1.98\times {10}^{7}\) based on recent population data64, since the last official census in India was performed in 2011, and the number reported there may not be representative of the current scenario.

A critical question here is the choice of \(\alpha\) and \(\beta\) for the two tests to ensure our computations reflect adjustments made based on sensible and realistic scenarios. Based on previously reported sensitivity and specificity levels for the diagnostic test10,49, we used the combinations \(\alpha =\beta =1 \left(\text{perfect test}\right), \alpha =0.952 {\text{and}} \beta =0.99\), \(\alpha =0.85 {\text{and}} \beta =0.99\), \({\text{and}} \alpha =0.7 {\text{and}} \beta =0.99\). The serological assay used by NCDC is a customized assay, and we referred to existing literature on and publicly available discussions on this particular assay, alongside literature on serological assays in general48,49, and decided to use the combinations of \(\alpha =\beta =1 \left(\text{perfect test}\right)\), \(\alpha =0.976 {\text{and}} \beta =0.993\), \({\text{and}} \alpha =0.92 {\text{and}} \beta =0.97\).