Introduction

The novel coronavirus disease 2019 (abbreviated “COVID-19”), caused by the SARS-CoV-2 virus, was declared a pandemic by the World Health Organization (WHO) on March 11, 2020. COVID-19 fundamentally differs from the other existing epidemics, including SARS and Ebola. It has caused unprecedented and all-round challenges, devastation and crises to health, society, the economy, and many other aspects, with about 6M deaths and 460M confirmed cases reported all over the world (WHO COVID-19: https://covid19.who.int/.).

COVID-19 disease characteristics

Despite common epidemic clinical symptoms, such as fever and cough, COVID-19 presents other characteristics that makes it mysterious, contagious and challenging for quantification, modeling and containment. (1) High contagiousness and rapid spread: the review1,2 finds that the \(R_0\) of COVID-19 may be larger than 3.0 in the initial stage, higher than that of SARS (1.7–1.9) and MERS (\(<1\))3. SARS-CoV-2 is more transmissible than severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) although SARS-CoV-2 shares 79% genomic sequence identity with SARS-CoV and 50% with MERS-CoV, respectively4,5,6,7. (2) A wide range of incubation period: a median incubation period of approximately 5 days was reported in8 for COVID-19, which is similar to SARS. In9, the mean incubation period ranges from 4 to 6 days, comparable to SARS (4.4 days) and MERS (5.5 days). Although an average length of 5–6 days is reported in the literature, the actual incubation period may be as long as 14 days8,10,11,12. (3) A large quantity of asymptomatic and undocumented infections: asymptomatic infections may not be screened and diagnosed before the symptom onset, leading to a large number of undocumented infections and the potential risk of contact with the infected individuals13. For example, the review in14 reports 6–41% of populations are truly asymptomatic, while the study in15 shows that a large percentage (86%) of infections are undocumented, about 80% of documented cases are from undocumented ones. (4) High mutation with mysterious strains and high contagion: the major SARS-CoV-2 variants of concern such as B.1.1.7 (Alpha labeled by WHO), B.1.351 (Beta) and B.1.617.2 (Delta) variants emerge with a higher transmissibility (B.1.1.7 at about 50% increased transmission)16 and reproduction rate (increasing 1–1.4 by B.1.1.7)17, challenging existing vaccines, containment and mitigation methods.

COVID-19 modeling challenges

The aforementioned COVID-19 complexities became even more sophisticated in the first 2 months of the COVID-19 pandemic. This early stage of COVID-19 presented various uncertainties in terms of case reporting and testing insufficiency and inconsistencies, making the reported data noisy and uncertain. Modeling such COVID-19 uncertainties significantly challenge existing epidemic modeling and complex system modeling1,18. First, the COVID-19 transmission processes involve uncertainty, e.g., the randomness of infection and contagion particularly during the incubation period and for asymptomatic infectious cases, making them difficult to model properly. Second, many observable and hidden factors (e.g., related to asymptomatic contagion and habitual behaviors) and mitigation-related factors (e.g., lockdown, social distancing, and human cooperation) interact with each other and collaboratively affect the COVID-19 transmission processes and dynamics. Third, the infection and contagion processes and the transition between different states such as the susceptible, the infectious, and the recovered seem to be highly complex, including being random, nonlinear, time-varying, and noisy. Lastly, the documented COVID-19 data with the confirmed, death, and recovered case numbers (e.g., in the JHU CSSE19) are macroscopic and subject to significant data uncertainty, i.e., quality issues, including acquisition inconsistencies, noise, errors, under-reporting and missing reportings, and randomness in case confirmation and reporting in different countries and regions. The publicly available case data does not disclose the full picture and the hidden nature of COVID-19 dynamics and may not reflect the reality. For example, inaccurate statistics and missing reportings likely exist in a considerable number of asymptomatic infections. The actual compartments of susceptible, infectious and recovered populations may be difficult to obtain, resulting in highly unreliable data and poorly evaluated ground truth for evaluation.

In addition, social reinforcement is another phenomenon embedded in a COVID-19-affected community. In social systems, a stimulus from one person may increase the frequency of the behaviors that immediately precede it. Such interpersonal stimulus is called social reinforcement, which characterizes the reinforced influence of social behaviors20. The COVID-19 pandemic also demonstrates large-scale social behaviors and interactions. Hence social reinforcement is an important aspect to understand COVID-19 transmissions. Examples of social reinforcement in COVID-19 are infections through dense and close social contacts, household-to-household infections, household and local community infections, and the phenomenon that increasing infection awareness may slow the spread of infectious diseases.

As a result, modeling COVID-19 is highly challenging. Special attention must be paid to the above various uncertainties, in addition to the epidemic attributes. However, the existing data-driven COVID-19 modeling on the poor-quality and uncertain COVID-19 case data appears highly challenging, easily resulting in overfit, underfit, or non-actionable results1,18.

Modeling gap analysis

In light of the huge number of publications reported on modeling COVID-191,2, we roughly categorize COVID-19 modeling into three directions: epidemic compartmental modeling of the COVID-19 infection and transmission processes, which is built on epidemiological compartments and models for the existing epidemics; data-driven modeling of COVID-19 intrinsic characteristics and infection processes on the relevant COVID-19 data; and hybrid modeling by integrating knowledge and modeling methods for a compound or comprehensive epidemic understanding and insight of COVID-19. A typical epidemic compartmental model following conventional epidemics is the susceptible-infected-recovered (SIR) model. SIR simplifies the transmission process and separates the population into three compartments: the susceptible, the infectious, and the removed. A large number of SIR variants are available with more specific compartments. For example, SEIR21 adds an extra exposed compartment, and TSIR22 incorporates time-dependent transmission into SIR to model the varying transmission and removal rates over time. These classic SIR-based compartmental models were designed for past epidemics and their transmission process, which do not directly capture the aforementioned COVID-19 complexities.

Several very recent SIR-based extensions are available for modeling COVID-19. For example, Chen et al.23 explore the time-dependent SIR for the time-varying transmission of COVID-19. Such models simply assume the SIR variables are temporal, while the actual COVID-19 processes may evolve over multiple factors, e.g., enforced interventions, and diversified cooperation levels. Further, fine-grained SIR models like SIDARTHE24 and SEI_DI_UQHRD25 divide the infection process into more specific stages to mimic the features of COVID-19. However, they overfit the specific country/regional data and lack a general applicability. In addition, SIR-based probabilistic models like SIR-Poisson26 assume the infected case numbers follow specific distributions such as Poisson distributions, while the actual conditions of COVID-19 case developments may be much more complicated. In addition, limited research is available on modeling the interactions between COVID-19 infections and social reinforcement27, in particular, in the early stage of the COVID-19 pandemic.

A critical reason for the aforementioned problems of COVID-19 models is that they mainly focus on fitting the COVID-19 data (e.g., by regression) or reproducing the transmission processes (e.g., with specific hypotheses) rather than directly addressing the aforementioned COVID-19 complexities. This is also evidenced by the overwhelming publications on regression-based COVID-19 analysis in the global research communities2.

SUDR for modeling COVID-19 uncertainties

In this work, we are motivated to directly characterize the aforementioned COVID-specific uncertainties in the context of social interactions, asymptomatic infections, and data quality issues for the early stage of the COVID-19 pandemic. We address the modeling challenges and gaps by integrating both domain- (the epidemic and social attributes of COVID-19) and data- (quantifying COVID-19 attributes and factors) driven modeling. We aim to leverage multi-resources about COVID-19 and multi-aspect modeling capabilities to address the aforementioned various COVID-19 uncertainties and challenges1. Combining domain- and data-driven modeling thinking18, we characterize the COVID-19 epidemic processes by capturing asymptomatic and undocumented infections and social reinforcement which are essential but hidden in the COVID-19 systems and processes. This is achieved using a hybrid approach: (1) capturing and incorporating new knowledge and compartments about the COVID-19 epidemiology into enhanced epidemic SIR models; (2) incorporating data-driven probabilistic mechanisms into the epidemic SIR-based extension to model the uncertainties of COVID-19; and (3) creating factors and mechanisms to capture the social characteristics of COVID-19.

Accordingly, a density-dependent Bayesian probabilistic Susceptible-Undocumented infectious-Documented infectious-Recovered (SUDR) model is proposed. First, to capture the confirmed and undocumented asymptomatic infections, SUDR replaces the infection compartment in the basic SIR model with two compartments: undocumented infection (U), and documented infection (D). SUDR assumes that, when infected by the virus, the susceptibles first transfer to the undocumented infectious compartment and then move into the documented infected compartment only if detected. Second, we take a density-dependent view of COVID-19 infection development and characterize undocumented infections and social reinforcement in the COVID-19 contagion. Third, we incorporate probabilistic mechanisms to model the density likelihood-based prevalence, unknown infections, and the uncertain and noisy conditions of COVID-19 data. Lastly, Bayesian inference is applied to approximate the SUDR solution. To capture the imperfect and noisy statistics of COVID-19 data, we elaborate the model as a probabilistic extension with certain priors and solve it by sampling from the mean-field posterior distribution.

Figure 1 illustrates the SUDR rationale of modeling the undocumented and asymptomatic infections and the social interactions between infecteds (in red) and susceptibles (in green) in COVID-19. We assume all infections are undocumented at the beginning. Then, some will transit to documented infections once they are confirmed by COVID-19 testing. Since the majority of infected symptomatic individuals are identified as documented infections and then quarantined, they have a low probability of further infecting other susceptible individuals. Hence, we assume only undocumented infectious individuals can infect the susceptibles, and there are safe interactions between uninfected susceptibles and unsafe interactions with asymptomatic infections. More interactions and denser contacts with asymptomatic infections will increase the chance of being infected. Accordingly, the central green nodes in scenarios (a) and (c) share the same probability of being infected since they have the same density of unsafe interactions and close contacts with the infected. However, more unsafe interactions, as shown in (b), will increase the infection probability of the susceptible individuals, showing social reinforcement and cluster infection in COVID-19 transmission28. As a result, the infection rate of the central green node in scenario (b) is much higher (e.g., by three times if it is linear additive) than that of scenario (a). Thus, SUDR models the transmission rate as the function over the undocumented infection density.

Figure 1
figure 1

The SUDR rationale: modeling the social interaction density of asymptomatic COVID-19 infections. The different colored nodes represent individuals in different epidemiological compartments. The connections between nodes represent safe or unsafe social interactions with uninfected or asymptomatic infections. The susceptible, undocumented infected, and documented infected are represented by the green nodes, the red nodes with a dashed outline, and the red nodes with the solid outline, respectively. The undocumented infected with the dashed outline shows that they are contagious through unsafe interactions (the yellow lines), while the documented infected with the solid outline shows that they cannot infect the susceptible since they are quarantined or isolated. Thus, the social interactions between the susceptible and the documented infected (the black lines), if they exist, are safe and highly perceptive.

In summary, this work discloses the following insights and contributions in modeling COVID-19 uncertainties:

  • A susceptible-undocumented infectious-documented infectious-recovered model SUDR explicitly captures the undocumented infections corresponding to asymptomatic infections, often missed in existing COVID-19 modeling.

  • A probabilistic density-dependent infection function models both the COVID-19 uncertainty w.r.t. the infection rate over the density of undocumented infections and the exogenous contagion reinforcement through social interactions. It tackles the gaps with a constant or time-dependent assumption of infections.

  • Bayesian inference with a mean-field method solves the SUDR optimization to cope with the poor quality of COVID-19 data, including uncertainty, noise, and sparsity.

We empirically verify the effectiveness of our method in detecting undocumented infections with COVID-19 data from different countries with noise and sparsity. The experiment results show that our model outperforms the classic SIR model, time-dependent SIR model, and probabilistic SIR model on the COVID-19 data.

Results

Here, we report the results of SUDR in inferring undocumented infections and epidemic attributes. We further analyze the robustness of the model with different levels of sparsity.

Inferring undocumented infections

As discussed in the above, there is often a large number of undocumented (unreported) infected cases, in particular, asymptomatic or mild symptomatic infections, along with the COVID-19 transmission process. This is more evident at the early stage of the epidemic outbreak due to the limited number of tests and the lack of preparedness, and in the vaccinated communities owing to an enhanced immunity. Here, we verify this observation.

Using the documented infected case numbers, the undocumented infected case numbers in the selected 11 European countries are inferred by the SUDR model, as shown in Fig. 2. We carry out the inference in the first 2 months from the beginning of the COVID-19 epidemic outbreak in each country for case studies and evaluation. While undocumented infections may exist along with the whole process of COVID-19 transmission, under-reporting is even more prominent at the early stage of the epidemic outbreak due to the limited number of tests and the lack of preparedness. The specific time period for each country is shown in the third column in Table 1. As shown in Fig. 2, the posterior samples of the undocumented infection converge and the posterior samples of the documented infections fit well with the observations. The results show that there are many more undocumented infections than documented ones in this time period (a more in-depth quantitative comparison is given in the following part). Further, the prevalence of undocumented infection curves exhibit a similar trend. It is firstly increasing and then decreasing in most of the countries where COVID-19 spread rapidly, except Germany and the United Kingdom, as shown in Fig. 2. This common trend of undocumented infections across countries also reflects the increasing COVID-19 test capacity, the government’s enforcement of testing, and people’s increased willingness to be tested, which is consistent with real-world scenarios.

Figure 2
figure 2

The undocumented infections inferred by the SUDR model. We show the density of infected individuals, namely the prevalence of the undocumented and documented infections inferred by SUDR in 11 European countries in the first 2 months of their COVID-19 epidemic. For each country, the ground truth of their reported documented infection case numbers is shown by a dotted black line. The orange lines are 100 random posterior samples of documented infections. The green lines are 100 random posterior samples of undocumented infections inferred by SUDR, with 2.5% and 97.5% percentiles presented by the two solid black lines.

Table 1 A quantitative comparison of the documented and undocumented infections in each country.

In Fig. 2, the fluctuation of the two-colored curves illustrates the different stages of the epidemic contagion in the 2-month period. At the initial stage of the epidemic, most countries had a limited ability to test for the COVID-19 virus. Also, due to the long incubation period and the number of asymptomatic infections, most infected individuals may not have been tested immediately after infection. Hence, at the early stage of outbreaks, there may be a large proportion of undocumented infections, resulting in the significant exceedance of the green curves over the orange ones. Then, with the increase of testing availability and coverage and the enhanced public willingness to be tested, the number of undocumented infections drops gradually. If all the undocumented infections are immediately detected, the curve of the undocumented infections would only be a horizontal shift of the curve of documented infections because the undocumented infections would become documented once detected. However, the overall undocumented-to-documented trend shift still holds, explaining why the peak of documented infections always lags behind that of the undocumented ones in each country, as shown in Fig. 2.

Further, the results in Fig. 2 also show the different COVID-19 transformations and evolving states in each country. For instance, COVID-19 transmission was likely under better control at the end of the first 60-day period in Austria, Denmark and Switzerland since they passed the peaks of both undocumented and documented daily infections. In contrast, the United Kingdom and Germany were still at their early outbreak stages as the curves, especially the green curves, rise sharply. The rapid increase of undocumented infections in these countries demonstrates the number of infections increased rapidly without effective interventions.

Both undocumented and documented infection case numbers evolve over time. Since the fluctuation of documented infection case numbers lags behind the undocumented infection case numbers, it is difficult to compare them without proper time and data alignment. Hence, we only compare their peak values. We demonstrate the peak value of undocumented infections and the peak value of documented infections for each country in Table 1. In cases where the curve is still increasing and has not reached its summit, we simply replace the peak value with the maximum value. For documented infections, the observed maximum number of daily active cases in that period is listed in the fourth column, while for undocumented infections, we compute the mean peak value from the samples (the green curves shown in Fig. 2) inferred by the SUDR model. The 95% confidence interval is also illustrated along with each mean peak value of undocumented infections. The last column shows the ratio of \(\max {I^U}/\max {I^D}\), which reflects how big the quantitative gap is between the maximum numbers of undocumented infections and documented ones.

For most countries, the ratio \(\max {I^U}/\max {I^D}\) ranges from around 2 to 6 in the 60-day time period of the first wave of COVID-19. Some existing studies show similar results29. For example, the number of infected in Italy was estimated to be around 3.5 times higher than that reported at the end of February, 2020. However, two outliers are identified in the results: 12.86 (Germany) and 10.88 (the United Kingdom), which are much larger than the average estimated ratio. This is because, in the initial stage, the increase in the number of documented infections lags behind the evolving undocumented infections. When comparing the peak value of undocumented infections and the initial value of documented infections, the ratio becomes larger than the actual value. We notice that the number of active undocumented infections gradually decreases to a low level once the first wave is finally under control.

Overall, Fig. 2 shows that detecting undocumented infections and inferring the relationship with documented infections provide a reliable speculation about the COVID-19 contagion in the first 2 months of COVID-19 outbreaks. Table 1 further shows the quantitative peak values of documented and undocumented infections. The \(\max {I^U}/\max {I^D}\) ratio shows an intuitive evaluation of the gap between reported and unreported infections. These results may assist in understanding infection movement, forecasting an increase in detected infection cases, and initiating and adjusting the corresponding mitigation policies. In addition, since individual indicators do not paint a complete picture of evolving documented or undocumented cases, readers should cross-refer to all indicators to arrive at more comprehensive and trustful insights when making intervention policies and choosing the corresponding control measures.

Inferring the epidemic attributes

The main attributes describing the COVID-19 epidemic are the infection rate \(\beta \), the detection rate \(\theta \), and the removal rate \(\gamma \). \(\theta \) refers to the average transition from undocumented infection state to documented infection state from a statistical perspective. \(\gamma \) indicates how fast cases are removed statistically (it does not reflect the specific days for a case removal). The higher the gamma rate, the fast the case number gets decreased, resulting in fast control of the epidemic. Here, SUDR infers these variables on the reported data from 11 European countries.

First, the infection rate is one of the most important epidemiological attributes to describe the transmission and reproduction features of COVID-19. In existing studies, infection rate is typically modeled as a constant or time-varying variable. However, this assumption does not accurately reflect the characteristics and complexities, as discussed in the introduction to the COVID-19 transmission processes. Cluster infection is a prominent characteristic of the spread of COVID-19, and the virus transmission routes and circumstances usually involve household, local community and nosocomial infections28,30. Considering this particular epidemiological feature, we model the infection rate as a density-varying (or prevalence-varying) complex function in the SUDR model, which provides a much better capacity to capture the COVID-19 complexities. However, it is difficult to obtain an accurate closed-form solution for the complex prevalence-varying infection rate function. The reasons for this include: we have no idea about the micro-level transmission mechanism and the expression form; and the infection rate can only be inferred at discrete points (i.e., the observed prevalence of the reported infection) which are extremely sparse. Hence, we summarize some important statistical characteristics of the sampled infection rates over the undocumented infection densities inferred by our model and present them in the box and whisker plot in Fig. 3.

Figure 3
figure 3

The inference of the prevalence-varying infection rate \(\beta (I^U)\) of 11 countries. The complex COVID-19 contagion is modeled using the undocumented infection prevalence and density-varying function, which is inferred by the SUDR model through HMC sampling. This box and whisker plot depicts the significant descriptive statistics of the infection rate, including the median, the minimum, the maximum, the lower quartile, and the upper quartile. The distribution of the sampled infection rates and skewness are visually shown by displaying the quartiles and median. Here, we only overlay the medians (the red bar) for the purpose of conciseness.

The spread of the SARS-CoV-2 virus in the initial stage shows different transmission dynamics with changing infection rates among the 11 European countries. The box plot depicts what the distribution of the infection rate may look like. As shown in Fig. 3, countries like Austria, Germany, Spain and Switzerland have relatively higher average infection rates (23.2, 23.1, 21.9 and 21.0, respectively) compared with France and Sweden (11.3 and 12.9, respectively). Furthermore, the variation range is reflected by the minimum, the lower quartile, the upper quartile, and the maximum. Since the infection prevalence is defined on the domain [0, 1], whereas the observed densities are usually close to 0 but never reach 127, it can also be inferred that the larger the variation range, the more sensitive the complex contagion function over the infection density.

Lastly, in addition to verifying the infection rate, SUDR also infers two other epidemiological attributes: the detection rate, and the removal rate, from the data. As shown in Table 2, the detection rate \(\theta \) indicates the average COVID-19 test ability and test coverage in a country. The higher the detection rate, the faster the undocumented infection cases drop. For instance, as shown in Table 2, the detection rates in four countries, Austria, Denmark, Spain and Switzerland are much higher than the others. As shown in Fig. 2, the undocumented infection cases in these four countries drop quickly until approaching the level of documented infection cases. We also find that the removal rates \(\gamma \) in the four countries are also relatively higher. Considering that most undocumented infections are on asymptomatic or mildly symptomatic patients who are easier to cure, the number of removal cases will increase in unit time when more undocumented asymptomatic or mild infections are detected.

Table 2 The detection and removal rates inferred by SUDR in each country.

Robustness analysis

As previously mentioned, the reported COVID-19 case data contains various uncertainties and quality issues, including the randomness of case reporting, statistical errors, missing undocumented infection cases, missing reportings, inconsistencies in reporting standards, etc. With such significant uncertainties in the COVID-19 data, as a probabilistic compartmental model, SUDR is more robust and applicable than the existing SIR and its variants. This is because SUDR assumes the parameters follow a certain distribution instead of a fixed constant or function.

Here, we evaluate the SUDR robustness through backtesting validation on the COVID-19 case numbers in the Hubei province, China from Jan 12, 2020 to Mar 23, 2020, collected by JHU CSSE19. We choose this data to validate SUDR robustness due to its extremely demanding challenges. Hubei was the location of the first large-scale outbreak of COVID-19. When the epidemic started to spread, there was limited knowledge about the virus and its containment. The data also involves different confirmation criteria, e.g., the inclusion of suspected cases with a clinical diagnosis of confirmed cases in Hubei, China on Feb 12th, 2020. In comparison with other late reported data, this data is more complex in its case reporting uncertainty, noise and statistics. Comparatively, the aforementioned European data may be less uncertain and noisy since some reporting mistakes were already corrected19. As the Hubei case numbers already contain noise such as statistical errors, missing values, and so on, here we incorporate various degrees of sparsity into the data by randomly masking some of its values, resulting in four sets: the complete data, 5% sparsity, 10% sparsity, and 20% sparsity. In this experiment, the degrees of Bernstein polynomials of the \(\beta \) function, the deviation hyper-parameters, and the HMC parameters of SUDR are the same as in the above experiment.

Three baselines are chosen for the robustness comparison. First, SIR is a classic compartmental model with fundamental biological insight. Second, time-dependent SIR23 is an SIR with time-dependent functions to model the transmission rate and removal rate and applies the ridge regression for the model solution. Lastly, complex SIR27 is a probabilistic extension of SIR by replacing the constant transmission rate with a density-dependent function that relies on the infection case numbers. These baselines only model the explicitly documented infections as they cannot detect undocumented infections. For the sake of fairness, the comparison experiments only test how well these models fit the reported cases under complex data conditions. The settings of the time-dependent SIR and complex SIR models are the same as in their original designs for optimal performance.

In the backtesting, according to the known case numbers (including the population, the documented infection numbers, and the recovered and death case numbers), we infer the infection rate \(\beta \) and the removal rate \(\gamma \) using these models. Then, with the initial values, the case number series can be obtained step by step using the ODE functions of the models. The robustness and effectiveness of the models can be estimated by how well the computed case number series fit the observed daily cases in the data under different noise conditions.

As shown in Fig. 4, SUDR and complex SIR achieve a similar performance. SUDR performs better in the first half of the time period (before day 30), while the complex SIR performs better in the second half (after day 50). This suggests that SUDR pays more attention to the data before day 30 in inferring the epidemiological parameters, while the complex SIR does the opposite. However, both models perform better than the time-dependent SIR and classic SIR at different levels of sparsity. With the increase of sparsity, the performance of SUDR and complex SIR drops gradually but still outperforms the others. The classic SIR model (the blue curve) shows quite a different trend to the real observation data, indicating the significant inaccuracy of the inferred transmission rate and removal rate. Obviously, it is not reliable to infer the trend of COVID-19 merely from the constant mean values of transmission rate and removal rate. The time-dependent SIR model performs better than the classic SIR model as it captures some changes in the observations and is trivially affected by the sparsity level. In contrast, the time-dependent SIR is fragile to noise. It is noteworthy that the Hubei data involves more confirmed cases due to the relaxed case confirmation since Feb 12, 202023. This specification adjustment leads to a lift in infectious cases around the 32n day, as shown in Fig. 4. After this adjustment cutoff point, the time-dependent SIR does not fit the actual infectious case numbers, especially in the second half stage. In summary, the probabilistic compartmental models, namely SUDR and complex SIR, are robust enough to combat the noise and sparsity in the data reporting.

Figure 4
figure 4

The performance comparison on the COVID-19 data with different levels of sparsity. The black dots refer to the density of daily infection cases. The colored lines show the density of daily infection cases inferred by the four models: SIR, time-dependent SIR, complex SIR, and SUDR. With the backtesting validation, the visualization shows the robustness and effectiveness of these models, indicating how well the estimated prevalence by these models fit the actual observations at different levels of sparsity.

The comparison results in Fig. 4 provide some general insights. On one hand, the compared models represent three typical directions of epidemic modeling: the epidemiological compartments, the time dependency of case numbers, and the uncertainty of case reporting. These are important concerns in understanding the COVID-19 complexities by epidemic modeling: the classic compartmental model (e.g., SIR), time-dependent compartmental model (e.g., time-dependent SIR), and probabilistic compartmental model (e.g., complex SIR and SUDR). On the other hand, the complex conditions of COVID-19 data must be captured in COVID-19 modeling, including missing values, statistical errors, rectification, and sparsity. In addition, it is observable that probabilistic compartmental models like SUDR outperform the classic compartmental models and time-dependent compartmental models, as shown by the results.

Discussion

Accurately inferring the undocumented infection case numbers of COVID-19 is one of the most challenging tasks in modeling COVID-19, which is even more difficult for the data collected in the very early stage of the COVID-19 pandemic. The challenge comes from various uncertainties related to not only the COVID-19 epidemic represented by the sophisticated epidemiological attributes of the coronavirus but also other diversified data uncertainties. In particular, a high proportion of asymptomatic and mildly symptomatic infections with a high contagion threat to the susceptible exist, with strong inconsistencies in case reporting methods, timing, and confirmations1. The public data for the early stage is with various data quality issues, including noise, inconsistencies and errors. These issues are still apparent in the current COVID-19 resurgence, mainly caused by coronavirus mutations (such as Delta, Lambda and Omicron variants) and in the vaccine breakthrough infections.

This study proposes an inference approach from the macro-level perspective for this complex social-tech problem. There is no true knowledge about the actual underlying interactions between entities and in the process of COVID-19 transmissions. Accordingly, a density-dependent infection function better captures complex contagion dynamics, including social reinforcement and non-monotonous relations between the expected epidemic size and their average transmission rate, than other typical methods of modeling constant and time-dependent infection rate.

Contrary to complex contagion functions, we adopt a concise and plain four-compartment SIR-like model to characterize the COVID-19 transmission processes. The proposed SUDR shows a stronger generalization ability than the elaborative compartmental models which may include seven or more states. Due to a lack of knowledge about the underlying contagion interactions and spread patterns, it is thus appropriate to design a generalized model that can avoid vital deviations and mismodelling errors in characterizing the actual contagion mechanisms.

The second observation from this work is that probabilistic compartmental models are a good choice to characterize complex data conditions in COVID-19 reporting. With Bayesian frameworks, probabilistic compartmental models outperform other mathematical epidemic models by assuming the central epidemiological parameters follow certain distributions. This naturally captures the uncertainty in both the COVID-19 processes and case data, which is superior to typical constant models (e.g., the classic compartmental models SIR and SEIR) and time-varying function models (e.g., time-dependent compartmental models). In addition, probabilistic compartmental models also offer better robustness and interpretation than classic compartmental models and time-dependent compartmental models.

However, our work and similar probabilistic compartmental modeling can be further enhanced in various ways. First, it is difficult to obtain the accurate infection function due to the extreme sparsity of the prevalence and the sampling method. The relationship between the infection rate variation and the undocumented infection density is still unknown by the current model. Second, SUDR assumes the clusters are isomorphism and homogeneity. In fact, the population stratification and the interaction structure within a cluster may influence the COVID-19 contagion, requiring further study. Lastly, probabilistic compartmental models strongly depend on the prior knowledge of distributions and hyperparameters, which however, are difficult to obtain. In addition, there are also other factors that may be considered: the number of tests, the methods and coverage of testing, the infectious period, and the delay in case documentation of each case, if such data is available.

Going beyond modeling social reinforcement on infections, there are many other complex factors and interactions in the COVID-19 problem space. These include virus mutations, vaccination rate and efficacy, nonpharmaceutical interventions, external factors such as weather and mobility, and their joint influence on COVID infection, transmission and containment. These factors interact and jointly affect the evolution of the COVID-19 pandemic and the endemic in a region, together with other internal and external factors. Increasing specific research has been reported on each of these aspects, however, only limited research is available on jointly modeling these interactions and influence31. A future topic relevant to this work is to explore probabilistic compartmental modeling in modeling the interactions and influence of such factors.

Methods

Data

We evaluate the SUDR model in detecting undocumented infections under imperfect conditions, i.e., the reporting noise and under-reported numbers in the publicly available data. We test the model on real-world 60-day COVID-19 data from 11 European countries32, a subset of the global COVID-19 case dataset reported by JHU CSSE19. The data records the worldwide daily case numbers, including confirmed case numbers, recovered case numbers, and death case numbers. The data is publicly available, and we confirm this case study confirms our university’s research ethics and all experiments were performed in accordance with relevant guidelines and regulations.

Here, we only extract the initial period (i.e., the first 60 days) of the COVID-19 outbreak in these countries. This early state is more likely embedded with undocumented cases and it is more challenging to model and control the epidemic dynamics. In general, the first waves and the resurgence of new COVID-19 variants are often more challenging to model and propose interventions31. The challenges usually come from a limited number of COVID-19 tests, poor test coverage, poor knowledge and awareness of COVID-19 complexities including transmissions, incubation periods, mutated attributes, and the difference from their original strains. At this stage, many confirmed cases may only be documented after obvious symptoms appear and sufficient test toolkits are available. This thus incurs a larger proportion of undocumented infections.

Modeling COVID-19 transmission mechanisms

SUDR is a compartmental epidemic model embedded with Bayesian statistical methods. It jointly models the COVID-19 epidemic processes, asymptomatic infections, social reinforcement of contagion, and imperfect data conditions.

Figure 5
figure 5

Compartment representation of the SUDR model. SUDR sequentially characterizes the COVID-19 transmission mechanisms as follows: (1) the susceptible (S) transit to the undocumented infected (\(I^U\)) once they are infected at the infection rate \(\beta \); (2) the undocumented infected are either detected at the detection rate \(\theta \) and become documented (\(I^D\)) or removed directly; and (3) the documented infected are removed. We assume both the undocumented and documented infected become recovered or deceased (R) at the same removal rate \(\gamma \).

Figure 5 illustrates the SUDR model for the epidemiological compartmental characterization of COVID-19. SUDR comprises four compartments \((S, I^U, I^D, R)\) to simulate the entire transmissions with asymptomatic infections and the transfer from an undocumented to a documented state. Accordingly, the COVID-19 transmission and dynamics are formulated per Eqs. (1)–(4) over time steps \(t=1,2,\ldots ,T\) (corresponding to each day in daily case reporting).

$$\begin{aligned} \frac{dS(t)}{dt}= & {} -\beta (I^U(t)) S(t) I^U(t)/{P} \end{aligned}$$
(1)
$$\begin{aligned} \frac{dI^U(t)}{dt}= & {} \beta (I^U(t)) S(t) I^U(t)/{P} - \theta I^U(t) - \gamma I^U(t) \end{aligned}$$
(2)
$$\begin{aligned} \frac{dI^D(t)}{dt}= & {} \theta I^U(t) - \gamma I^D(t) \end{aligned}$$
(3)
$$\begin{aligned} \frac{dR(t)}{dt}= & {} \gamma (I^U(t) + I^D(t)) \end{aligned}$$
(4)

S refers to the number of susceptible individuals who are not epidemically contained and thus may be exposed to the virus at the infection rate (function \(\beta \)). When infected, a susceptible transits to the undocumented infectious compartment (Eq. 1). P refers to the subpopulation involved in the epidemic, which is assumed to be a part of the entire population W (this is particularly applicable to the first COVID-19 waves and new resurgence after full zero-infection containment). As superspreading events (SSEs) and cluster infection are common in the COVID-19 pandemic33,34, not all people in W are susceptible, particularly when they geographically stay far away from the epicenter or adopt effective self-protection measures (e.g., wearing face masks or staying at home). In other words, SUDR does not involve such individuals in the epidemic transmission processes to be modeled. Accordingly, we assume only \(\alpha \in [0, 1]\) of the entire population W is involved in the active epidemic shown in Fig. 5, i.e., \(P = \alpha W\).

\(I^U\) is the number of undocumented individuals contracting the virus, who can thus infect those susceptible individuals such as close contacts or household infections. They are undocumented as they may be either in an incubation period or asymptomatic. This undocumented group forms an important determinant of the pathogen’s pandemic potential, as these infections are likely undiagnosed but highly contagious15. Those undocumented infectious individuals, once confirmed with the virus infection (e.g., by diagnosis test) at detection rate \(\theta \), transit to the documented infectious compartment \(I^D\) (Eq. 2), who are then quarantined and will rarely further infect other susceptible individuals. We assume those observed cases fall in this group. People in \(I^D\) will then either be cured or unfortunately die, and then directly transit to the removed compartment R at the removal rate \(\gamma \) (see Eq. 3). Both \(I^U\) and \(I^D\) are time-dependent over time t. R combines both recovered and deceased individuals who are converted from the undocumented and documented infectious compartments (see Eq. 4). We further assume the recovered and dead individuals are immune against the virus, i.e., they will not further infect other people.

Modeling the asymptomatic infections

As illustrated in Fig. 1, COVID-19 infectious individuals may infect the susceptible during their incubation periods or when they are asymptomatic. However both scenarios are undetectable. In addition, it is shown that a large proportion of asymptomatic infections cannot be detected immediately. These asymptomatic infections are a great challenge to sourcing and containing infections before the onset of symptoms and infecting other people, leading to a significant time delay in treating the infected and mitigating their contagion spread. To address the incubative and asymptomatic infections, we partition the infectious population I into undocumented \(I^{U}\) and documented \(I^{D}\) infectious individuals. Those undocumented cases could be in incubation or asymptomatic, and we assume all COVID-19 infections are likely initially undocumented. However, those with the onset of symptoms and diagnosed will be detected, transferring to the documented compartment \(I^{D}\) at detection rate \(\theta \).

We further assume that only undocumented infectious individuals are infectious to the susceptible since those who are detected are likely quarantined and are unlikely to further infect the susceptible without close contact. The undocumented infections may have a much higher probability than the documented to interact with the susceptible when they have minimal symptoms or are unaware of infection. This assumption is consistent with reality especially at the early stage of the COVID-19 outbreak, when both viral testing and effective protection are limited.

Modeling the contagion reinforcement

The contagion of COVID-19 may be reinforced during unsafe social interactions and reinforcement, as COVID-19 can be regarded as a complex social reinforced contagion network. When a susceptible individual is infected, their close contacts may have a higher probability of being infected. The infections of close contacts will further be passed to their contacts. Consequently, the population infection probability increases nonlinearly at the density of infected neighbours in a chained way. This explains the commonly seen cluster infections, such as through local communities like households, parties and hospitals, which dominate the spread of COVID-19.

SUDR thus models this COVID-19 contagion reinforcement, which may be caused by various contagious factors. We model the transmission rate as the function of the density of the infected population, inspired by27. Compared with assuming a time-dependent transmission rate in the epidemic modeling, a density-dependent transmission rate function can more reasonably characterize the social reinforcement of COVID-19 contagion and provide a better interpretability of dominating cluster infections.

Modeling data uncertainty, sparsity and noise

To model the aforementioned COVID-19 data quality issues including noise, sparsity and randomness, we incorporate Bayesian inference into SUDR, making it capable of modeling these data conditions. For this, we refer the density of documented infections at time t as the COVID-19 prevalence \(Y_t\) for the measurement. \(Y_t \in [0, 1]\), which is much closer to 0 due to the large population size. By assuming that the population is well mixed, the likelihood of the prevalence \(Y_{1:T}\) can be obtained as:

$$\begin{aligned} & P(Y_{1:T}|\beta ,\theta ,\sigma ,\gamma ,y_0) \\&\quad= \int _{\mathscr{Y}^T}P(Y_{1:T}|y_{1:T}, \sigma )P(y_{1:T}|\beta ,\theta ,\gamma ,y_0)d y_{1:T} \end{aligned}$$
(5)

\(y_t \in \mathscr {Y}\) corresponds to the state set of susceptible, undocumented infectious, documented infectious, and removed people at time t. \(y_0 = (S_0, I^U_0, I^D_0, R_0)\) corresponds to the initial state. The noise component is shown in Eq. (6), which is a normal distribution with mean \(I[y_t]\) (referring to the density of the infectious individuals in the state set at time t) and standard deviation \(\sigma \).

$$\begin{aligned} P(Y_{1:T}|y_{1:T}, \sigma ) = \prod _t q(Y_t | I[y_t], \sigma ) \end{aligned}$$
(6)

Since there is not a closed-form solution for Eq. (5), we take a mean-field approximation method for the inference. Similar to the inference in27, we only consider the largest contribution in Eq. (5), leading to

$$\begin{aligned} P(Y_{1:T}|\beta ,\theta ,\gamma ,\sigma ,y_0) = \prod _{t \in T} q(Y_t|\tilde{y}_t(\beta ,\theta ,\gamma ;y_0),\sigma ) \end{aligned}$$
(7)

where \(\tilde{y}_{1:T}(\beta ,\theta ,\gamma ;y_0)\) is the time series of the density of infectious individuals, computed from Eqs. (1)–(4) given the initial condition \(y_0\).

With the prevalence likelihood, we further obtain the posterior distribution of the prevalence data \(Y_{1:T}\):

$$\begin{aligned} P(\beta ,\theta ,\gamma ,\sigma ,y_0|Y_{1:T}) = \frac{P(Y_{1:T}|\beta ,\theta ,\gamma ,\sigma ,y_0)P(\beta ,\theta ,\gamma ,\sigma ,y_0)}{P(Y_{1:T})} \end{aligned}$$
(8)

Before sampling, we assume the priors for \(\beta \), \(\theta \), \(\gamma \) and \(\sigma \) in the likelihood. We first parameterize the infection rate function \(\beta \) since we cannot directly place priors for functions. Bernstein polynomials are adopted for the parameterization as shown in Eq. (9), where N is the degree of Bernstein polynomial for \(\beta \) with coefficients \(\xi _{0:N}\).

$$\begin{aligned} \beta (I^U) = B_N (I^U; \xi _{0:N}) \end{aligned}$$
(9)
Figure 6
figure 6

The graphical model of the inference for SUDR. The grey node \(Y_t\) refers to the observations, and the white nodes are variables to be inferred from the observations. The box shows the dimension of the variables. The prior information is characterized by the hyperparameters (black dots).

The SUDR model summary

In summary, we have the SUDR model to infer the COVID-19 prevalence \(Y_{1:T}\) at time \(t = 1, \ldots , T\) and \(i = 0, \ldots , N\) as follows:

$$\begin{aligned} \begin{aligned} Y_t&\sim \text {Normal}(\tilde{y}_t, \sigma ^2) \\ \sigma ^2&\sim \text {Half-Cauchy}(0, a) \\ \tilde{y}_{1:t}&= \text {SUDR}(\beta (\xi _{1,\ldots ,N}), \theta , \gamma , y_0(S_0, I^U_0, I^D_0)) \\ \xi _i&= \mu _{\xi } + \delta _i \\ \mu _{\xi }&\sim \text {Half-Normal}(0, b) \\ \delta _i&\sim \text {Half-Normal}(0, c) \\ \theta&\sim \text {Half-Cauchy}(\mu _{\theta }, d) \\ \gamma&\sim \text {Half-Cauchy}(\mu _{\gamma }, e) \end{aligned} \end{aligned}$$
(10)

where \(\text {SUDR}(\beta (\xi _{1,\ldots ,N}), \theta , \gamma , y_0(S_0, I^U_0, I^D_0))\) returns a mean-field time series of prevalence with a contagion function \(\beta \) parametrized by the degree N Bernstein polynomials of coefficients \(\xi \), detection rate \(\theta \), removal rate \(\gamma \), and initial conditions \(y0=(S_0, I^U_0, I^D_0)\). Since there is no information about the initial cases, here, we assume the initial conditions \(S_0, I^U_0, I^D_0\) follow distributions:

$$\begin{aligned} \begin{aligned} S_0&\sim \text {Half-Normal}(\mu _{s_0}, f) \\ I^U_0&\sim \text {Half-Normal}(\mu _{I^U_0}, g) \\ I^D_0&\sim \text {Half-Normal}(\mu _{I^D_0}, h) \end{aligned} \end{aligned}$$
(11)

Figure 6 further shows the probabilistic graphical model of SUDR, where the grey circle refers to the observed data, namely the reported infections; and the white circles stand for the variables to be inferred by the model. The hyperparameter is represented by the black dot, and the capital letter in the box indicates the number of the variables contained in the box. The probabilistic graphical model clearly demonstrates the dependency relationship between the variables.

Model implementation

SUDR is implemented in the STAN probabilistic programming language for statistical inference35. The Hamiltonian Montre-Carlo (HMC) algorithm is adopted to generate samples from the posterior distribution in Eq. (8). The observed daily infectious case numbers are divided by the corresponding population of each country to obtain the density (the prevalence). For the sake of simplicity, we set \(\alpha \) as a constant value 0.01 in our experiments, indicating that 1% of the whole population in the country is involved in the epidemic transmission process. We set \(N=8\) for the degrees of the Bernstein polynomial of the \(\beta \) function since the low degree Bernstein polynomial performs well enough for the inference. For the deviation hyper-parameters in Eq. (10), we set \(b=10\), \(c=5\), \(e=10\), \(a=d=f=g=h=1\), \(\mu _{\theta }=\mu _{\gamma }=\mu _{I_0^U}=\mu _{I_0^D}=0\), and \(\mu _{s_0}=0.01\). For the HMC algorithm, the default four chains are adopted for sampling. Other sampling parameters like the iteration number and control parameters are adjusted for each country until convergence.