Introduction

An outbreak of severe pneumonia, an infectious disease now named COVID-19, was reported in China in December 2019. The aetiological agent, SARS-CoV-2, a novel coronavirus, was isolated by Chinese authorities on 7th January 2020 and reported by WHO on 9th January 2020. The first case of COVID-19 was reported to have symptom onset on 1st December 2019 in Wuhan city of Hubei province, China1, after which the virus quickly spread to other parts of China2,3 and the first case outside China was reported on 13th January 2020 (see Table 1). Because of the rapid spread of the disease, WHO announced the outbreak of COVID-19 as a “public health emergency of international concern” on 30th January 2020 and assessed COVID-19 as a pandemic on 11th March 20204. By 25th March 2020, 405,742 cases and 18,791 fatalities associated with COVID-19 had been reported globally, of which, 81,218 cases and 3281 fatalities were reported in mainland China. By 10th April 2020 (25 days later), whilst the corresponding numbers in China had remained relatively unchanged (81,953 cases and 3339 deaths), the global numbers increased dramatically to 1,521,252 cases and 92,798 deaths. The comparatively small increase in the numbers of cases and deaths in mainland China can be attributed to draconian and rapid control measures implemented since late January 2020, starting from the lockdown of the epicentre Wuhan city on 23rd January 2020 and then extending to all of mainland China3. These included the complete shutdown and isolation of whole cities, cancellation of Chinese New Year celebrations, and prohibition of school and work attendance, frequent multi-media broadcasts of critical information (e.g., promoting hand washing, mask wearing, and care seeking), massive mobilization of health and public health personnel and military medical units, rapid construction of hospitals for patients with severe symptoms, and reconstruction of shelters for patients with no or mild symptoms. Control measures included travel bans and restrictions, contact reductions and social distancing, screening and contact tracing, early case identification and isolation2,5 to curb the epidemic.

Table 1 Timeline of outbreak in mainland China.

As a novel coronavirus, it is important to understand its transmissibility and severity and to judge how effectively control measures in China have helped stop the spread of SARS-CoV-2 virus which may help inform control of COVID-19 outbreaks in the world and its future potential re-emergence. In previous analyses of transmission dynamics in China, Kucharski et al.6 fitted a stochastic transmission dynamic model to four datasets of cases from Wuhan and just considered the period up to 11th February 2020. They estimated that the median daily reproduction number (Rt) in Wuhan declined from 2.35 (95% CI 1.15–4.77) 1 week before travel restrictions were introduced on 23rd January 2020, to 1.05 (95% CI 0.41–2.39) 1 week after. Other modelling studies2,3,7,8 also used the confirmed case data from the early pandemic stage to model and estimate the spread, dynamics and control of the COVID-19 outbreak in China. As the number of new confirmed case in mainland China reached a peak only after 17th February 2020 and other relevant data such as those relating to fatality and hospital discharges are becoming available, a further study including more data streams is needed to show the overall effectiveness of the integrated control measures executed in the whole nation.

Based on the published data associated with the confirmed cases, death and recovery, in this study we use a Synthesis model (Fig. 1) which covers both the transmission dynamics of the SARS-CoV-2 virus and the disease reporting process of COVID-19 to estimate the transmissibility of SARS-CoV-2 and the severity of COVID-19 in China and the impact on the spread of COVID-19 of control measures the Chinese government has taken from late January 2020.

Figure 1
figure 1

Flow chart of the synthesis model: transmission dynamics and disease reporting processes. The five rectangular boxes represent the hidden transmission dynamics process with U representing the number of people who have recovered from an undetected infection and the four blue shaded polygons represent the quantities upon which observations were made. The quantities above the arrows represents the rate at which people go from one compartment to the next and β = \(\beta \left( t \right)\left[ {I_{1} \left( t \right)\left( {\theta_{1} +\xi \left( {1 -\theta_{1} } \right)} \right) + \xi I_{u} \left( t \right)} \right]\).

Results

The estimates of the model parameters are shown in Table 2. The proportion of COVID-19 cases in mainland China that were detected and reported was about 3.5% (95% confidence interval (CI): 1.7–8.3%) before 19th January 2020, and then increased to 36.6% (95% CI 26.1–55.4%) afterwards. At the early stage before 8th February 2020, the transmissibility of COVID-19 was high with R0 = 2.33 (95% CI 1.96–3.69); however, it reduced dramatically to 0.04 (95% CI 0.01–0.10) from 8th February 2020 (95% CI 7th–9th February 2020). By fixing the duration from onset of symptoms to death and recovery at 17.8 days and 22.6 days respectively9, the average infectious period for those that are symptomatic and confirmed and then quarantined is 2.28 days (95% CI 2.01–3.12 days) and the average infectious period for undetected infections is 4.44 days (95% CI 3.91–11.71 days).

Table 2 Estimates of model parameters for the COVID-19 outbreak within mainland China.

In our model, the basic reproduction number is contributed by two parts: the confirmed cases and undetected infections. Confirmed cases and undetected infections are assumed to be equally infectious in this study; combining with the estimate that only about 3.5% of infections were confirmed and reported at the early stage, this suggests that at the early stage, about 98% of R0 (Table 2) is due to the undetected infections. If the infectiousness of undetected infections is only half or one third of that of confirmed cases7, R0 in the absence of interventions remains nearly the same, and about 96% of R0 is due to the undetected infections (Supplementary Table S2.1). This indicates that only isolating confirmed cases and their contacts is not enough to stop the spread and that the main factor that stopped the COVID-19 outbreak in mainland China was the dramatic drop in contact rates among the general population (c.f.7).

The model fitting to the daily number of hospitalizations (confirmed cases), the daily number of deaths, and daily number of recovered people are shown in Fig. 2. Our model analysis suggests that the death rate could have changed around 15 February 2020 (95% CI 14th–16th February 2020). Before this date the confirmed case fatality rate (cCFR) is 9.6% (95% CI 8.1–11.4%), after this date it reduces to 0.7% (95% CI 0.4–1.0%). Based on the estimates of the case ascertainment rate and case fatality rate, the infection fatality rate (IFR) is 0.33% (95% CI 0.17–0.85%) before 19th January 2020 and increases to 3.51%(2.60–5.12%) during the period from 20th January to 15th February 2020; after this, it decreases to 0.24% (95% CI 0.15–0.41%). Our model inference suggests that on 1st December 2019, there were 41 (95% CI 13–207) people infected with SARS-CoV-2 virus. These estimates are based on the assumption that the 1290 deaths added on 17th April 2020 were distributed before 20th February 2020. If we assume that the 1290 deaths were distributed before 17th April 2020, the estimates of most model parameters remain nearly the same except for the death-related quantities (Table 2). For example, after 15 February 2020 cCFR changes from 0.67 to 1.28%, and the IFR from 0.24 to 0.50%.

Figure 2
figure 2

Epidemic curves from 1st December 2019 to 21st April 2020. Model predictions of infections (panel a) and model fitting to the daily numbers of confirmed cases (panel b), deaths (panel c) and recovered cases (panel d) are shown. The model predictions are obtained after redistributing the 1290 deaths added to the official data on 17th April 2020 to the period before 20th February 2020 in accordance with the daily number of deaths reported before 17th April 2020. The green dotted lines represent the model predictions (large green circles for median and thin green dotted lines for lower and upper levels of 95% confidence interval). The red points are the observed data. The black, blue and red vertical lines denote the estimates of times when the case ascertainment rate, transmission rate and confirmed case fatality rate respectively are estimated to have changed. In panel (b), the blue points represent daily number of imported cases which are nearly invisible. Note that the daily number of confirmed cases (15,152) on 12th February 2020 (day 74) is beyond the range shown in panel (b).

If the public had been made aware of COVID-19 earlier and the control measures started earlier, the size of outbreak could have been much smaller. This can be quantitatively analysed by assuming the same epidemiological characteristics but moving the start times at which the reporting, transmission and death rates changed. The results are listed in Table 3. It shows that if the control measures started 1 week, 2 weeks, or 3 weeks earlier, about 57% (57%), 81% (82%) and 93% (93%) of the confirmed cases (and deaths) would have been averted. However, if the control measures had started 1 week, 2 weeks, or 3 weeks later, the numbers of the confirmed cases (and deaths) would have increased 2.3-fold (2.3), 5.3-fold (5.4), and 12.4-fold (12.5) across mainland China, respectively.

Table 3 Impact of changing the start time of control measures during the outbreak in mainland China on the number of infections, confirmed cases and deaths from 1st December 2019 to 21st April 2020.

To investigate how the epidemic within the epicentre differs from that over the whole nation, we also obtain estimates of model parameters for Hubei province and Wuhan city (Supplementary Table S3.1). The results show that within the epicentre, the transmissibility and case ascertainment rate differ only slightly. However, there was a large difference in the estimated confirmed case fatality rate. Before the time point τF when the confirmed case-fatality was estimated to have changed (15 February 2020), the estimated cCFR is 9.6% (95% CI 8.1–11.4%) across the whole country, 11.5% (95% CI 10.4–13.5%) over Hubei province, and 15.1% (95% CI 13.3–17.6%) within Wuhan city. However, the estimate of cCFR appears to be nearly the same in all three settings after τF. These estimaes imply that the fatality rate before 15 February 2020 is significantly higher at the epicentre than the national average, and many deaths of COVID-19 patients were due to insufficient treatments at the epicentre.

Discussion

Understanding the transmissibility and severity of the novel coronavirus (SARS-CoV-2) is paramount; understanding how its rapid spread was brought under control in mainland China is of practical importance for other countries now facing ongoing outbreaks of COVID-19. In this study we have studied the transmissibility and control of SARS-CoV-2 in China by using a mathematical model to reconstruct the COVID-19 outbreak in mainland China from 1st December 2019 to 21st April 2020. Our analyses indicate that the SARS-CoV-2 had a basic reproduction number of 2.33 (95% CI 1.96–3.69) in the absence of intervention measures and therefore it is highly transmissible. The fatality rate among those that are symptomatic and confirmed was about 9.6% before 15 February 2020, and then reduced to 0.7%. The draconian control measures taken by the Chinese government from 23rd January 2020 brought the spread of SARS-CoV-2 under control in mainland China. However, it took more than 2 weeks for the effect of control measures to emerge.

Our study shows that the early reporting rate (3.5%) was very low, suggesting that about 96.5% of all infections were undetected prior to 19th January 2020 (c.f.7,10,11). This might be mainly due to the limited knowledge and unclear definition of the novel disease12. The reporting rate of 36.6% at the late stage reflects increased awareness of the virus and consequently escalating rate of medical help seeking behaviour for respiratory symptoms; however, this rate appears relatively low and suggests that many infections were associated with mild symptoms or were asymptomatic. Analyses by China CDC13 using data up to 11th February 2020 estimated that COVID-19 has been mild for 81% of infected people. The high proportion of mild symptomatic or asymptomatic infections was further confirmed by a well-investigated outbreak on the Diamond Princess cruise ship during February 2020: among 696 confirmed cases, 410 (58.9%; c.f.14) were asymptomatic15.

The actions taken by the Chinese government appear to have stopped the spread of COVID-19 in mainland China. This was achieved under very strict control measures, which dropped the transmissibility, as reflected by the effective reproduction number to 0.04, which is 1.8% of its initial value of R0 = 2.33 (c.f.3). This estimate of R0 at the early stage is consistent with most previous estimates3,6,7,8,10,11,16,17,18,19,20,21,22. We found that the infectious period before isolation for confirmed cases is about 2.3 days, which is comparable with the estimates23,24,25,26; though it appears to be shorter than that obtained by Tian et al.3: 5.19 days (95% CI 4.51–5.86 days). Our estimate of the infectious period for infections that were not detected and not quarantined (4.4 days ranging from 3.2 to 11.7 days) agrees with the range identified in previous studies27,28 of 1–14 days.

Although increasing the rate of detection and quarantine of symptomatic cases can help reduce the number of sources of infection (c.f.2), the main force of transmission is from the undetected cases which contributed to most of the transmissibility of SARS-CoV-2 during the early stage of the outbreak in mainland China (c.f.7). Hence it is likely that limiting the mobility of the general population made the biggest contribution to stopping the spread of SARS-CoV-2 virus in mainland China. This may further explain why the draconian control measures implemented in China since the 23rd January 2020 took about 16 days (i.e., delay from 23rd January to 8th February 2020) to influence transmission within community.

The impact of actions taken by the Chinese government is also reflected on the change of mortality associated with COVID-19. Before 15th February 2020, cCFR is estimated at 9.6% and then it reduced to about its 7% (i.e., 0.67%). This huge reduction in cCFR may be due to the significant improvement of medical treatment and availability of medical resources in China, especially in the epicentre Hubei province.

It is interesting to compare our estimate of the mean case fatality rate against that from other studies, which, after averaging the values estimated before and after the 15th February, is 5.14%. This is slightly less than that of Deng et al.29: 5.65% (95% CI 5.50–5.81%). Including death data up to 17th April 2020, Deng et al.29 used individual-level data for cases to obtain their estimate which is close to both the corresponding crude or naïve confirmed case fatality risk: 4632/82,758 = 5.60% and the approximator of deaths/(deaths + recoveries) = 4632/ (4632 + 78,112) = 5.60% as of 21st April 2020. Based on the data up to 11th February 2020, Verity et al.9 suggested that the overall CFR for the outbreak in China was 1.38% (95% CI 1.23–1.53%). Using data up to 29th February 2020, Wu et al.30 found the overall symptomatic case fatality rate (the probability of dying after developing symptoms) to be 1.4% (95% CI 0.9–2.1%). Russell et al.31  used several simplifying assumptions to obtain  an estimate of the CFR in China to be 1.2% (95% CI 0.3–2.7). Our estimate of CFR is higher than these, which largely results from the fact that these three studies did not include the 1290 deaths added to the official data on 17th April 2020 by the Chinese National Health Commission32.

Through model fitting to observed data in mainland China, we found that the model predicted that there had been twenty eight times more cases than were reported at the early stage of the outbreak; even at the late stage since 19th January 2020, the estimated number of infected people were more than twice those reported. Based on the epidemiological characteristics obtained from COVID-19 outbreak data, our analyses suggest that even if the control measures had started 1 week earlier, they would have averted 57% of confirmed cases and 57% of deaths. If they had started 3 weeks earlier, then 93% confirmed cases and 93% of deaths would have been averted. This estimate of the effect of start time of control measures is smaller than but comparable with what Lai et al.2 and Yang et al.5 found, although both studies estimated only the number of infections. These findings highlight the importance of an early response in controlling transmission in the population.

As in Kucharski et al.6, we assumed the latent period is equal to the incubation period. In view of evidence that there is pre-symptomatic transmission28,33,34, it is interesting to know whether this will alter the results of our model. For this, we modified the model system (Fig. 1) by dividing exposure stage E into two equal sub-classes E1 and E2 and assumed that people in E2 can transmit the virus with the same infectiousness as ill cases (SI Sect. 4). We obtain similar results (Supplementary Table S4.1): For example, the basic reproduction number before the control measures is 2.22 (95% CI 1.95–2.92) of which 98% was due to undetected cases, and the cCFR is 9.5% (95% CI 8.0–11.2%) before 15 February 2020 and then reduces to 0.7% (95% CI 0.4%, 1.1%). This indicates our conclusions are relatively unaffected by assumptions relating to pre-symptomatic transmission.

The non-pharmaceutical interventions introduced in mainland China appeared to have stopped the spread of the virus, but the risk of another potential outbreak remains. As estimated in our analyses, at most 223,111 individuals were infected up to 21st April 2020 during the outbreak (Table 3), which is less than 0.02% of the Chinese population. Even if all those recovered from infection have developed complete immunity35, this level of immunity in the Chinese population is considerably lower than the herd immunity threshold of 55% which is required for control of transmission36,37. This simply implies that once the strict quarantine measures currently in place in mainland China are relaxed, transmission of SARS-CoV-2 in China is very likely to rebound, especially given the extent of ongoing transmission across the rest of the world (c.f.38). Nevertheless, the non-pharmaceutical interventions appear to have halted the spread of SARS-CoV-2 virus in mainland China and bought time for vaccines and drugs to be developed and used later on.

It is of interest to compare the novel coronavirus SARS-CoV-2 with other two coronaviruses: SARS-CoV-1 and MERS-CoV which caused large outbreaks in human populations. SARS-CoV-1 has a R0 from 2 to 539 and case-fatality rate of 9.6% among probable cases in mainland China40. MERS-CoV in the 2015 outbreak in South Korea has been estimated to have a R0 from 2 to 741 and case fatality rate of 34.5% among laboratory-confirmed cases42. This suggests that that SARS-CoV-2 is nearly as transmissible as SARS-CoV-1 and MERS-CoV, but it is potentially less severe particularly when sufficient treatments are available. Furthermore, there was no evidence of a super-spreader event occurring in any of the Chinese health facilities serving COVID-19 patients, which is distinct from the 2003 outbreak of SARS-CoV and 2015 MERS-CoV outbreak in South Korea. Due to advanced modern transportation, people can move easily and quickly across the world, which allows SARS-CoV-2 to quickly spread to other countries. This is in a sharp contrast with both SARS-CoV-1 and MERS-CoV which have been controlled and confined within relatively limited areas of the world. This sharp difference may be attributed to another aspect of the coronaviruses: a large proportion of SARS-CoV-2 infections are associated with mild symptoms or are asymptomatic13,14,15, while both SARS-CoV-1 and MERS_CoV are highly symptomatic21,33. In the absence of high coverage with highly effective vaccines, this characteristic of the SARS-CoV-2 virus will make it challenging for humans to control and manage it.

Compared with previous studies2,3,5,6,21,22, our analyses have two key strengths: our investigations are based on three datasets (confirmed cases, deaths and recovered cases) and we model the outbreak over a long period (143 days) which should avoid any bias and confounding arising due to observations over a short period. This study also has several limitations. To model the transmission dynamics and disease reporting, our synthesis model has included several simplifications. To reflect the temporal change in the ascertainment rate, transmissibility and case fatality of COVID-19, two different values are assumed for each. These quantities are likely to change gradually during the outbreak, as is the case for public awareness and interventions3. For example, Tsang et al.12 found the ascertainment rate changed as the case definition for COVID-19 changed from being initially narrow to becoming gradually wider during the period from 15th January to 3rd March. The time-to-event intervals such as the delay from symptom onset to death may also change as epidemic grows9,24,26 and the time from symptom onset to hospitalization may reduce as more reliable medical resources become available22,24. Further, in this study we ignore the heterogeneity in both geography and age3,7,13. To provide more specific and practically useful information for control measures, it needs to look at variations between regions3,7 and in different age groups13. A further limitation is that we model the overall effectiveness of integrated intervention measures rather than the different types of control measures and therefore cannot provide insight into their relative impacts in stopping the spread of infection (c.f.2,3,16).

In conclusion, our finding that the main driver of transmission of SARS-CoV-2 at the early stage of the outbreak in mainland China came from the undetected infections provides helpful information for policy makers when designing optimal intervention strategies. In the absence of vaccination and effective drugs, early detection and isolation are essential for containing and controlling the spread of SARS-CoV-2 but reducing contact rates among people in the general population is likely to have the biggest impact.

Methods

Data

We extracted the following data relating to COVID-19 for mainland China from 3 datasets on the website of Chinese National Health Commission for the period from 1st December 2019 to 21st April 2020: the daily number of confirmed cases who were confirmed/admitted to hospital, the daily number of deaths and the number of patients who had recovered each day. Here the data are given by the symptom onset date during the period 1st December 2019–1st January 2020 from1 and by reporting date thereafter from the website of Chinese National Health Commission, due to difficulties in collecting the onset dates of those cases through the website. The reporting date is assumed to be the same as the date that cases were diagnosed. Before 12th February 2020, confirmed cases were defined as those who were positive for SARS-CoV-2 on PCR; from 12th February 2020 confirmed cases were defined (for the epicentre Hubei province) as those who were either clinically diagnosed or positive for SARS-CoV-2 on PCR12, and as those who were PCR positive for SARS-CoV-2 for the rest of China. This case definition resulted in the number of confirmed cases increasing from 2015 to 15,151 between the 11th and 12 February 2020.

The daily numbers of reported deaths were adjusted to account for changes to the numbers of deaths that were made by the Chinese government, as follows. Specifically, on 17th April 2020, Chinese National Health Commission revised its estimates of the total numbers of COVID-19-related deaths in Wuhan city that had occurred since the start of the outbreak from 2579 to 386932. It stated that these additional 1290 deaths probably occurred before 20th February 2020, given that, after this date, the number of hospitals that could treat COVID-19 patients increased from 2 to 48, and the HOUSHEN and LEISHEN hospitals, and FANGCANG shelters that have been constructed since then also provided an increased number of beds meeting the needs of COVID-19 patients with different symptoms. Furthermore, the system of data collection improved rapidly, and the number of missed cases/deaths decreased greatly during the course of the outbreak. As detailed information for these 1290 deaths is not available, for the sake of model fitting, these 1290 deaths were distributed in proportion to the number of deaths reported each day from the date of the first death (10th January 2020) to 20th February 2020. For example, the number of deaths reported on 18th January 2020 was 26, the number is now corrected to 26 + 26 × 1290/2236 = 26 + 15 = 41. Here 2236 is the cumulative number of deaths reported up to 20th February 2020 before the government had revised its estimates in April. In sensitivity analysis, we also explored the effect of assuming that these 1290 deaths added on 17th April 2020 were distributed in proportion to the daily number of deaths reported over the entire period from 10th January to 17th April 2020. Except for estimates of mortality rate,  the results shown in Table 2 (and described in the “Results” section) are very similar for both ways of distributing the 1290 deaths added to the official statistics on 17th April 2020.

Model

In this study we use a synthesis model43 (Fig. 1), which combines the hidden transmission dynamics of SARS-CoV-2 virus and the reporting system of COVID-19, to investigate the transmissibility and severity of COVID-19 and the efforts to contain and control the spread of SARS-CoV-2. We assume the transmission dynamics of SARS-CoV-2 virus are described by an SEIR compartmental model, with a few modifications, as shown in Fig. 1. That is, a susceptible person (S) can contract SARS-CoV-2 virus from infectious people and then enter the latent class (E); after an average latent period (L), the exposed person progresses to become infectious (I1). A fraction (θ1) of these people (with severe symptoms) will be detected and admitted to hospital (H) and will then be treated and isolated from the community, with a fraction (1 − θ2) recovering after an average period DR and the other fraction dying after an average period DD. The other fraction (1 − θ1) of infected people, typically with mild or asymptomatic infections (Iu) will not be detected and hence remain in the community as sources of infection for a further average period Du before recovering. For simplicity, we assume that the average latent period equals the incubation period, which is fixed at L = 5.2 days as estimated by18,26,44. SARS-CoV-2 virus can be transmitted by three possible modes: respiratory transmission (through respiratory droplets when symptomatic people sneeze or cough), aerosol transmission (through fine virus particles that were aerosolized) and contact transmission (through contacting the contaminated surface). For simplicity and consistency with these easy and quick modes of transmission, we assume that people mix randomly. Although China is a large country with a population of 1,400,050,000 people residing on a huge area of 9,596,960 km2, its recent urbanization and development of rapid transport systems make it easy and quick for people to move around the country. This makes it reasonable to model the transmission of SARS-CoV-2 within the whole country as a well mixing population. For comparison, we also model the spread of COVID-19 within Hubei province and Wuhan city where the outbreak started (SI Sect. 3).

The synthesis model is described by Eq. (1). The 9 compartments are defined in Table 4 and the definitions of model parameters are given in Table 2.

$$\begin{aligned} & \frac{d}{dt}S\left( t \right) = - \beta \left( t \right)S\left( t \right)\left( {I_{1} \left( t \right)\left( { \theta_{1} + \xi \left( {1 - \theta_{1} } \right)} \right) + \xi I_{u} \left( t \right)} \right)/N \\ & \frac{d}{dt}E\left( t \right) = \beta \left( t \right)S\left( t \right)\left( {I_{1} \left( t \right)\left( { \theta_{1} + \xi \left( {1 - \theta_{1} } \right)} \right) + \xi I_{u} \left( t \right)} \right)/N - E\left( t \right)/L \\ & \frac{d}{dt}I_{1} \left( t \right) = E\left( t \right)/L - I_{1} \left( t \right)/D_{1} + Imported\left( t \right) \\ & \frac{d}{dt}I_{u} \left( t \right) = \left( {1 - \theta_{1} } \right)I_{1} \left( t \right)/D_{1} - I_{u} \left( t \right)/D_{u} \\ & \frac{d}{dt}H\left( t \right) = \theta_{1} I_{1} \left( t \right)/D_{1} - 2\theta_{2} H\left( t \right)/D_{{{\text{D}}}} - 2(1 - \theta_{2} )H\left( t \right)/D_{{{\text{R}}}} \\ & \frac{d}{dt}Recovered\_0\left( t \right) = 2(1 - \theta_{2} )H\left( t \right)/D_{{{\text{R}}}} - 2Recovered\_0\left( t \right)/D_{{{\text{R}}}} \\ & \frac{d}{dt}Recovered\left( t \right) = 2Recovered\_0\left( t \right)/D_{{{\text{R}}}} \\ & \frac{d}{dt}Dead\_0\left( t \right) = 2\theta_{2} H\left( t \right)/D_{{{\text{D}}}} - 2Dead\_0/D_{{{\text{D}}}} \\ & \frac{d}{dt}Dead\left( t \right) = 2Dead\_0\left( t \right)/D_{{{\text{D}}}} \\ \end{aligned} $$
(1)
Table 4 Definitions of model compartments.

Here N = 1,400,050,000 is the total population size in mainland China and is assumed to be constant during the outbreak. The model also includes imported cases (i.e., Imported(t) in equation for I1) from outside China as reported by Chinese National Health commission since 24th March 2020, which is likely to help with improving the reliability of our estimates of the transmissibility at the late stage of the epidemic and the effectiveness of intervention measures. Including the numbers of imported cases in the model means that we implicitly include people entering China at the latent stage. For example, due to the stronger and improved testing and tracing services at the late stage of the epidemic, they would have been put into isolation for 2 weeks after entering China and would have been detected as cases during that time.

Consistent with our assumption that people mix randomly, the force of infection (the rate at which susceptibles are infected per unit time) is assumed to be proportional to the number of infectious people at time t, which is given by the following expression in the equation for the rate of change in the number of susceptible individuals (i.e. dS(t)/dt):

$$\left( {I_{1} \left( t \right)\left( {\theta_{1} + \xi \left( {1 -\theta_{1} } \right)} \right) + \xi I_{u} \left( t \right)} \right)$$

This expression comprises three terms: I1(t)θ1—the number of infectious people at time t who have severe symptoms and will be detected and hospitalised, I1(t)(1 − θ1)—the number of infectious people who have mild or no symptoms and will not be detected and therefore will transfer to the compartment Iu, and Iu(t)—the number of infectious people who have not been detected and will remain so. The parameter ξ is the relative infectiousness of people with undetected infections compared to confirmed cases.

To model the potential changes in the transmission rate (the rate at which two specific people come into effective contact per unit time) due to the combined and dramatic control measures taken by Chinese government, we simply assume that the transmission coefficient, β, differed before and after they are introduced as shown in the Eq. (2)

$$\beta \left( t \right) = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\beta_{a} } & {t \le {\tau_{\beta} } } \\ \end{array} } \\ {\begin{array}{*{20}c} {\beta_{b} } & {t >{\tau_{\beta }} } \\ \end{array} } \\ \end{array} } \right.$$
(2)

(c.f.3,41). Here τβ is the time point when the transmission rate changed as a result of the interventions and will be estimated in our analyses.

In addition, the limited knowledge and unclear definition of COVID-19 in the early stages of the outbreak means that the proportion of cases that were detected and reported was probably low and increased as knowledge about COVID-19 improved and the availability of advanced techniques to test SARS-CoV-2 increased12. In particular, case-finding, diagnosis and reporting have sped up since 20th January 2020. Local governments across China encouraged and supported routine screening and quarantine of travellers from Hubei Province in order to identify COVID-19 infections as early as possible2. To reflect these changes, we assume that the proportion of people that were detected and reported (θ1) (or rate of case ascertainment) varied with time as follows

$$\theta_{1} = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\theta_{1,a} } & {t \le \tau_{\theta } } \\ \end{array} } \\ {\begin{array}{*{20}c} {\theta_{1,b} } & {t > \tau_{\theta } } \\ \end{array} } \\ \end{array} } \right.$$
(3)

Here τθ is the time point when the rate of case ascertainment changed.

It is also likely that the case fatality rate changed over time, as treatment improved with increasing provision of medical resources. For simplicity, we introduce a time point τF so that before that time, the confirmed case fatality rate is θ2,a and after this time point it becomes θ2,b, as follows:

$$\theta_{2} = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\theta_{2,a} } & {t \le_{F} } \\ \end{array} } \\ {\begin{array}{*{20}c} {\theta_{2,b} } & {t > \tau_{F} } \\ \end{array} } \\ \end{array} } \right.$$
(4)

θ2,a and θ2,b and τF will be estimated in our analyses.

Although there may have been some sporadic cases of COVID-19 that might not have had the chance to be hospitalized and so would die at home, especially at the early stage of the outbreak, our assumed flow of confirmed COVID-19 patients should approximate actual procedures during the outbreak in mainland China reasonably well45. In this study we fix the average durations from onset of symptoms to death and to hospital discharge at (D1 + DD =) 17.8 days and (D1 + DR =) 22.6 days, respectively, consistent with the estimates9 from the China outbreak data. To allow the durations from hospitalization to death and from hospitalization to recovery to follow the Gamma distribution rather than the usual exponential distribution46, we introduce the intermediate compartments Dead_0, and Recovered_0. As we use (mostly) data of reported dates, the time-event-length (duration) should be thought to include such reporting delay, which is not explicitly treated in this study6.

The basic reproduction number R0, which is defined as the average number of secondary infectious people generated by an infectious person introduced into a completely susceptible population, is an important quantity which characterises the transmissibility of infectious agents36. We can obtain the expression for R0 by considering the situation without imported cases and obtain the steady-state solution of the equation system (1), with its S* (the size of the population susceptible to infection at equilibrium) being given by

\(S^{*} = N\frac{{1/D_{1} }}{{\beta_{{\text{a}}} [\theta_{{1,{\text{a}}}} + \xi \left( {1 - \theta_{{1,{\text{a}}}} } \right) + \xi \left({1-\theta_{1,{\text{a}}}}\right) D_{{\text{u}}} /D_{1} ]}}\).

At the steady state, R0 × (S*/N) = 136,37, the above expression gives the basic reproduction number at the early stage as follows

$$R_{{0,{1}}} = \, \left( {\theta_{{{1},{\text{a}}}} D_{{1}} + \xi ({1} - \theta_{{{1},{\text{a}}}} )\left( {D_{{1}} + D_{{\text{u}}} } \right)} \right)\beta_{{\text{a}}}$$
(5a)

Similarly, at the late stage after the time point defined as max (τβ, τθ), the effective reproduction number under interventions is

$$R_{{0,{2}}} = \, \left( {\theta_{{{1},{\text{b}}}} D_{{1}} + \xi ({1} - \theta_{{{1},{\text{b}}}} )\left( {D_{{1}} + D_{{\text{u}}} } \right)} \right)\beta_{{\text{b}}}$$
(5b)

From Eq. (5a), the contribution from undetected infections to the transmissibility at the early stage is

$$R_{{0,{1}}} \left( {\text{undetected cases}} \right) \, = \xi ({1} - \theta_{{{1},{\text{a}}}} ) \, \left( {D_{{1}} + D_{{\text{u}}} } \right)\beta_{{\text{a}}}$$
(5c)

Here D1 + Du is the average infectious period of the infected people who were not detected. The parameter ξ is introduced to measure the relative infectiousness of undetected infections to confirmed cases, and it will take the value of 1.0 (i.e., both undetected and confirmed infection are of the same infectiousness). The above equations for the basic reproduction number can also be obtained using the Next Generation Matrix approach described in Diekmann et al.47 (see SI Sect. 5).

To see how the relative infectiousness affects our findings, estimates of R0 and other model parameters for ξ = 1/2, 1/3 are given in Supplementary Table S2.1.

Initial seeding

In this study we model the transmission process from 1st December 2019 with initial infections of I1(0) = I0 which is to be estimated from model fitting to data (see below). The initial values of the other variables were obtained from I0 and the total population size by assuming that the epidemic around 1st December 2019 was at its early exponential growth stage, as described in SI Sect. 1. In contrast, Wu et al.21 assumed that the epidemic during 1st–31st December 2019 was seeded by a constant zoonotic force of infection that caused 86 cases (twice the number of 43 confirmed cases with zoonotic exposure) per day before market closure on 1st January 2020. Kucharski et al.6 assumed the outbreak started with a single infectious case or 10 cases on 22nd November 2019.

Inference method

We denote the set of model parameters to be inferred as Θ = {I0, βa, βb, τβ, θ1,a, θ1,b, τθ, θ2,a, θ2,b, τF, D1, Du} which is listed in Table 2. For each set of parameter values, the Runge–Kutta fourth order method was used to solve the model equations and to obtain model predicted time series of infections, confirmed cases, deaths, and recovered cases. In the inference of model parameters, directly observed dataset of confirmed/hospitalized/reported cases (denoting it as HOS for short in the following), Death and Recovery are used as illustrated in the following. To capture the large dispersion in the daily numbers of these observations, the negative binomial likelihood function was assumed. The likelihood for number xC (t) of observations on day t is given as

$$l(x^{C} \left( t \right)|{\Theta },\eta^{C} ) = \frac{{{\Gamma }\left( {x^{C} \left( t \right) + r^{C} \left( t \right)} \right)}}{{{\Gamma }\left( {r^{C} \left( t \right)} \right){\Gamma }\left( {x^{C} \left( t \right) + 1} \right)}}\left( {\frac{1}{{\eta^{C} }}} \right)^{{r^{C} \left( t \right)}} \left( {1 - \frac{1}{{\eta^{C} }}} \right)^{{x^{C} \left( t \right)}}$$
(6a)

where

$$r^{C} \left( t \right) = \frac{{\mu^{C} \left( t \right)}}{{\eta^{C} - 1}}$$
(6a)

Here \(\eta^{C}\) is the dispersion parameter, which is estimated and μC(t) are the predictions of the number of hospitalisations, deaths or newly recovered people, as appropriate on day t from synthesis model (1). Superscript C represents three different datasets: HOS, Death and Recovery.

Special attention was paid to the extra high daily number of cases (15,152) on 12th February 2020 (day 74 from 1st December 2019) due to the change in the case definition in Hubei province12,48. In principle, most of these cases might have been accumulated over the days before 12th February 2020. To deal with this complexity, only the cumulative numbers of cases on the 12th February and daily numbers of cases after that will be used in the model inference. Let the reported daily number of HOS be represented by x(1), x(2), …, x(T), with T being the number of days from 1st December 2019 to 21st April 2020. The likelihood of the cumulative number of cases on 12th February \(X = \sum\nolimits_{t = 1}^{74} x \left( t \right)\) is assumed to be

$$L_{74} = \frac{{{\text{exp}}\left( { - M} \right)}}{X!}{\text{\rm M}}^{X}$$
(7)

Here \({\text{M}} = \sum\nolimits_{t = 1}^{74} {\mu^{{{\text{HOS}}}} } \left( t \right)\) represents the cumulative number of confirmed cases predicted by model. Assuming that the observed daily number of deaths: y(1), y(2),…,y(T) and daily number of recovered people: z(1), z(2),…,z(T) are conditionally independent, the total likelihood given model parameters Θ is

$$L\left( {{\Theta },\eta^{{{\text{HOS}}}} ,\eta^{{{\text{Death}}}} ,\eta^{{{\text{Recovery}}}} } \right) = L_{74} \times \mathop \prod \limits_{t = 75}^{T} l\left( {x\left( t \right){ {|\Theta }},\eta^{{{\text{HOS}}}} } \right) \times \mathop \prod \limits_{t = 1}^{T} l\left( {y\left( t \right){ {|\Theta }},\eta^{{{\text{Death}}}} } \right)l\left( {z\left( t \right){ {|\Theta }},\eta^{{{\text{Recovery}}}} } \right)$$
(8)

Parameter inference: We assume the uninformative prior distributions f(Θ) which are uniform for parameters (Table 2). Employing a Bayesian framework through the combination of the prior distribution f(Θ) and the likelihood L(Θ,ηHOS, ηDeath, ηRecovery; x, y, z), the posterior distribution can be obtained by Markov Chain Monte Carlo simulations (MCMC)49. From these samples, we obtain the medians and the 95% confidence intervals for the model parameters. The posteriors of the model parameters will provide the estimates of the transmissibility and the severity of SARS-CoV-2 in mainland China and the effects on transmissibility of control measures implemented by the Chinese government. The same inference method will also be used when modelling the outbreak in epicentre Hubei province and Wuhan city.

To explore the effect of the timing of control measures, we calculate the numbers of infections, cases and deaths that would have occurred if the control measures had been implemented 1, 2, or 3 weeks earlier than they had been, or if they had been implemented 1, 2, or 3 weeks later than they had been. To do this, we simply change the start times at which the reporting, transmission and death rates changed while the transmission parameters, ascertainment rate and fatality rate remain unchanged.