Modeling and forecasting the early evolution of the Covid-19 pandemic in Brazil

We model and forecast the early evolution of the COVID-19 pandemic in Brazil using Brazilian recent data from February 25, 2020 to March 30, 2020. This early period accounts for unawareness of the epidemiological characteristics of the disease in a new territory, sub-notification of the real numbers of infected people and the timely introduction of social distancing policies to flatten the spread of the disease. We use two variations of the SIR model and we include a parameter that comprises the effects of social distancing measures. Short and long term forecasts show that the social distancing policy imposed by the government is able to flatten the pattern of infection of the COVID-19. However, our results also show that if this policy does not last enough time, it is only able to shift the peak of infection into the future keeping the value of the peak in almost the same value. Furthermore, our long term simulations forecast the optimal date to end the policy. Finally, we show that the proportion of asymptomatic individuals affects the amplitude of the peak of symptomatic infected, suggesting that it is important to test the population.


Scientific Reports
| (2020) 10:19457 | https://doi.org/10.1038/s41598-020-76257-1 www.nature.com/scientificreports/ Let β be the average number of contacts that are sufficient for transmission of a person per unit of time t. Then βI/N is the average number of contacts that are sufficient for transmission with infective individuals per unit of time of one susceptible and (βI/N)S is the number of new cases per unit of time due to the S susceptible individuals. Furthermore, let γ be the recovery rate, which is the rate that infected individuals recover or die, leaving the infected class, at constant per capita probability per unit of time.
Based on these definitions, we can write the SIR model as It is worth mentioning that we can also evaluate the number of recovered individuals from Eq. (1) using also the number of susceptible and infected individuals, since in this version of the SIR model (Eq. 2) the population is constant. Actually, since we are modeling a short term pandemic, we do not consider the demographic effects and we assume that an individual does not contract the disease twice. We do not implement this model, we only included it for the sake of reference.
We actually want to estimate the fraction of people that die from the disease. Then we include a probability ρ of an individual in the class I dying from infection before recovering 4 . In this case, we get the following set of equations where ρ 1−ρ γ I is the number of people in the population that die due to the disease per unity of time and D is the number of people that die due to the disease. Note that in this case the number of individuals in the population reduces due to the infection according to dN dt = − ρ 1−ρ γ I . For the ease of reference, we call this model "SIRD" (Susceptible-Infected-Recovered-Dead) model.
Since, in the case of the COVID-19, there is a relevant percentage of the infected individuals that are asymptomatic, we split the class of infected individuals in symptomatic and asymptomatic 7-9 : where I A is the number of asymptomatic individuals, I S is the number of symptomatic individuals, R A and R S are the recovered individuals from the asymptomatic and symptomatic infection, respectively, and p is the proportion of individuals who develop symptoms. For ease of reference, we call this model "SIRASD" (Susceptible-Infected-Recovered for Asymptomatic-Symptomatic and Dead) model. Like the SIRD model, the condition that N is constant does not hold anymore and if we need to evaluate N over time, we need to integrate dN dt = − ρ 1−ρ γ S I S . In order to consider the effect of the social distancing policy, we modify the transmission factors of Eqs. (3) and (4) by multiplying them by a parameter ψ ∈ [0, 1] , when the date belongs to the period of the implementation of government policy. Otherwise, we use ψ = 1 . To be precise, we replace β in Eq. (3) by ψβ , β A in Eq. (4) by ψβ A and β S in Eq. (4) by ψβ S . Note that doing this procedure we avoid the introduction and estimations of new " β s" and we may use ψ to evaluate the effectiveness of social distancing policy. In the end, we may measure the social distance as 1 − ψ.
Our models provide estimates of the epidemiological parameters, that are consistent with the international literature, and good forecasts of the short-term Brazilian time series of infected individuals in Brazil. Furthermore, one of our models assesses the number of asymptomatic (or individuals with mild symptoms that do not (3) www.nature.com/scientificreports/ look for the hospitals and are not being tested). We use these models to simulate long-term scenarios of the pandemics that depend on the level of engagement of the Brazilian social distancing policy. We show that: (1) the social distancing policy imposed by the government is able to flatten the pattern of contamination provided by the COVID-19; (2) there is an optimal date for abandoning the social distancing policy; (3) short-term social distancing policies only shift the peak of infection into the future keeping the value of the peak in almost the same value. (4) The proportion of asymptomatic individuals affects the amplitude of the peak of symptomatic infected, meaning that it is important to invest in testing the population, massively or by random sampling. Our work relates to the recent interesting contributions [10][11][12][13] in the sense that all these works try to model the spread of the COVID-19 and to evaluate the countermeasures against this virus. However, our paper differs from these works in the following dimensions: (1) data: our work focuses in Brazilian data. This is an important characteristic since different countries may present different demographies and we know that the COVID-19 is riskier for older populations that appear with higher proportion in developed countries. Furthermore, the level of nutrition of the population of the country may affect the probability of contracting and developing the disease. The quality of data may vary from developed countries to underdeveloped ones and, in our paper, we do not use data from other countries to calibrate our models. (2) Model: we use variations of the SIR model mentioned above. One of the advantages of the SIR model is the simplicity and researchers have used this model in several successful attempts to model the spread of infectious diseases [14][15][16][17] . (3) Estimation: our paper estimates all the parameters based on a clear hierarchical procedure based on squared error minimization.

Results
Data analysis. We use the real data provided by the Ministry of Health of Brazil from February 25, 2020 to March 30, 2020 in our estimations. If we change the final date of the period of estimation of the epidemiological parameters of the model, we note that there is a structural change in the data suggesting the effectiveness of the social distancing policy. It is worth mentioning that it is hard to know exactly when social distance measures took effect mostly because there is a variable incubation period of the virus given by a range from 2 to 10 days 18 and some initiatives of social distance measures (such as home office) started even before the official implementation of the social distancing policy. In fact, after March 23, 2020, we are able to see in the data three consecutive reductions in the first difference of the cumulative number of infections, so depending on the final date that is used for the estimation of the SIRD model, the estimated parameters cannot fit the real data anymore, as shown in Fig. 1. Thus, we define two estimation periods: (1) February 25, 2020-March 22, 2020, in which we estimate the epidemiological parameters of Eqs. (3) and (4); (2) March 23, 2020-March 30, 2020, in which we estimate the paramter ψ.
Regarding the estimation of the epidemiological parameters of Eqs. (3) and (4), we estimate all parameters of our model by minimizing the squared error of integrated variables and their real values 5,28 . We proceed in a hierarchical procedure. We start by estimating the parameters of the SIRD model, namely β , γ and ρ by minimizing the squared error where I cum t and D t are the cumulative number of infected individuals and deaths, which are the real data provided by the Ministry of Health of Brazil, and Î t , R t and D t are estimated values of the infected, recovered and deaths, respectively. We use the nonlinear function f (z) = C 2 log (g(z)/C) 2 to correct the exponential characteristic of the series so that the errors of the last values of the series do not dominate the minimization, where www.nature.com/scientificreports/ g(z) = log(1 + z) . Furthermore, we use the scaling parameter C = 2 to soft threshold between inliers and outliers. Using this procedure, we note that the estimated epidemiological parameters vary less among simulations with different random seeds.
After estimating the SIRD model, we proceed by estimating the SIRASD model. Note that we lack information on the number of asymptomatic individuals, since the clear recommendation of the Ministry of Health is to test for the virus only if one has moderate or severe symptoms. Otherwise, follow the "stay at home" policy, which recommends individuals with mild symptoms to stay at home and do not seek for medical attention. Furthermore, the mortality rate is evaluated mostly over the symptomatic ones, since the asymptomatic are in many cases not tested. Therefore, we suppose that β S = β , γ S = γ and we keep the value of ρ . Using these parameters, and assuming that there is only one asymptomatic individual in the beginning of the simulation, we estimate the parameters β A , γ A and p in order to minimize the squared error where I cum t and D t are real data provided by the Ministry of Health of Brazil, the cumulative number of infected individuals and deaths, and Î S,t , R S,t and D t are estimated values of the symptomatic infected and recovered individuals, and deaths. Table 1 presents the epidemiological parameters of our model and some reference values. We also show other values for these epidemiological parameters obtained from other simulations in Table 5 in "Methods". Some of the lines of Table 1 deserve remarks. First, the basic reproductive number R 0 in both models are comparable to the values for China and Italy. Second, the death rate ρ is very close to the values disclosed by the Brazilian Ministry of Health and the average of international values. We point out that our estimation of the death rate uses data that presumes there are places in hospitals to treat patients with severe infections, that is the situation that is present in the data now. Depending on the government policy, we do not know whether this is true or not at the peak of infection. Third, the proportion of symptomatic individuals p is smaller than the international reference due to the Brazilian Ministry of Health policy "only test if you have strong symptoms". In fact, the same problem of underdiagnosis also seems to have happened in the early epidemics in China 29 .
In the last step of the estimation procedure, in order to estimate the parameter ψ , we keep all model parameters as previously estimated and we also minimize the mean squared error using loss functions similar to the ones defined in Eqs. (5) and (6), depending on the case, in the period after March 23, 2020. Furthermore, in order to evaluate the effectiveness of the social distancing policy, we estimate a new value of ψ for each new point of the time series as shown in Table 2, where the column 2 shows the estimations of ψ for the SIRD model and column 4 shows estimations of the same parameter for the SIRASD model. Although there is a small gap between the values of ψ for different models (SIRD or SIRASD), both columns suggest that the social distance factor ψ is going down, meaning that more people are joining the government policy. According to the models, the transmission rate is reduced to approximately 62% of its original value. Table 2 also presents the effective reproductive number derived from the impact of ψ on the transmission factors. Figures 2 and 3 present respectively the short-term forcasts of the SIRD and the SIRASD models, where the models incorporate the ψ factor in order to rescale the transmission factors ( β , β A and β S ) in the scenario with the social distancing policy imposed by the government. Note that Fig. 3 explicitly shows the propor-

Forecasts.
(2) Some parameters have not presented relevant variation in the significance level of this study. In these cases, the 90% interval includes only the value of the parameter.    www.nature.com/scientificreports/ tion of unknown asymptomatic individuals that when added to the symptomatic individuals skew the total value of infected individuals upwards. We also use the SIRD and SIRASD models to provide long term forecasts of the evolution of the COVID-19 pandemic in Brazil depending on the social distancing policy considered. While Fig. 4 shows the forecasts for the SIRD model, Fig. 5 shows the forecasts for the SIRASD model. In particular, we may note that while the SIRASD model predicts that the number of infected is higher than the estimates of the SIRD model, it also predicts a lower peak for the infected with symptoms, which are the ones that could require medical attention.
We explore four cenarios: (I) no measures of social distancing policy (black line); (II) current social distancing policy imposed by the government for an indefinite time (blue line); (III) 2-month social distancing policy imposed by the government (yellow line); and (IV) optimum limited time social distancing policy imposed by the government, so that the second infection peak is not greater than cenario II (red line). Scenario III suggests that policies based on short-term social distancing policy are not enough to constrain the evolution of the pandemic, that is, if social distancing policy measurements are released before the optimal time, a second peak should be experienced. The peaks and dates in which they occur are detailed in Table 3. In the case of Scenario IV, the last day of the social distancing policy is June 22, 2020 for the SIRD model and June 16, 2020 for the SIRASD model.
In addition to Fig. 5, we also present the evolution of the proportion of asymptomatic and symptomatic in Fig. 6. In this figure, we show the instant proportion [ I S,t /(I A,t + I S,t ) for the symptomatic and I A,t /(I A,t + I S,t ) for the asymptomatic] and the cumulative proportion as well. Note that the proportion of individuals who develop symptoms, p in Eq. (4), alters the transmission rate, so it also affects the evolution of the number of asymptomatic and symptomatic individuals over time. So this plot estimates the evolution of this proportion. The last column of the last line of Table 1 shows that the proportion of asymptomatic may vary from 29 to 37%, but this value is not fixed and evolves over time 26 . Our estimates suggest that the proportion of cumulative asymptomatic is approximately 68% in March 30, 2020, which converged to 1 − p (with p given in Table 1); that may account for some individuals with mild symptoms that were not tested.  www.nature.com/scientificreports/ Finally, it is worth considering that the SIRASD differential equations, presented in Eq. (4), need an initial condition for the number of asymptomatic individuals. If we find the parameters values (β A , γ A , p) by solving the optimization problem of Eq. (6) using different conditions, we get different results, that is, different peak values for the symptomatic individuals. If the proportion of asymptomatic individuals is larger, then this may be good news since it may represent less pressure for the health care system. But since we do not have enough tests to map the whole population, we need to work with hypotheses. Figure 7 shows the effect of different initial conditions in the symptomatic percentage and the peak value of symptomatic, that is, we vary the initial conditions, evaluate   www.nature.com/scientificreports/ the symptomatic proportion (parameter p in the SIRASD model), then calculate the peak value of symptomatic infected. So if we assume that the number of asymptomatic (symptomatic) individuals in data is larger (smaller) today, the number of asymptomatic (symptomatic) individuals will also be larger (smaller) in the time of the peak, leading to a smaller peak for the symptomatic.

Discussion
We use the Brazilian recent data from February 25, 2020 to March 30, 2020 to model and forecast the evolution of the COVID-19 pandemic in Brazil. We estimate two variations of the SIR model using historical data and we find parameters that are in accordance with the international literature. We also introduce a factor ψ to account for the effect of the government social distancing measures. Our methodology is able to estimate the asymptomatic individuals, that may not be entirely present in data. Since the Brazilian government does not have enough tests for mass testing, this measure may provide some additional information. In fact, we show the relevance of the number of asymptomatic individuals, since the larger the number of asymptomatic individuals, the smaller the number hospital beds needed. The "stay at home" and "only test if you have strong symptoms" policies present contradictory effects in the disease control. While they avoid an increase in the number of infected people and the use of extra resources with people that present only mild symptoms, they reduce the amount of information about the real number of infected individuals. In particular, it explains the low value of the parameter that measures the proportion of individuals who present symptoms, since we count many individuals with mild symptoms as asymptomatic.
While our short-term forecasts are in great accordance with the data, our long-term forecasts may help us to discuss different types of social distancing policies. We also show that the social distancing policy imposed by the government is able to flatten the pattern of contamination provided by the COVID-19, but short-term policies are only able to shift the peak of infection into the future keeping the value of the peak in almost the same value. Furthermore, we define the idea of the optimal social distancing policy as the finite social distancing policy that the second peak that happens after stopping the policy is not larger than the first. Based on this definition, we provide an estimate of the optimal date to end the social distancing policy.
An important discussion is about the effectiveness of vertical containment policies, where only people at risk follow social distance policies. In these kinds of policies, the two fractions of the population, the one at risk and the other one, present very different behaviors. First, the dynamics of the population at risk behaves similarly to the case with social distancing measures, but with a higher death rate. Second, the dynamics of the population that is not at risk behaves similarly to the case without social distance measures but with a low death rate. Third, since the fraction of the population that is at risk is much smaller than the rest of the population, the number of infected of the total population behaves similarly to the case without control. In fact, the policy's effectiveness is not in reducing the number of infected, but in reducing the number of deaths by confining individuals at risk. It is worth mentioning that the effectiveness of these vertical containment polices depends strongly on the ability to separate the individuals at high risk from the individuals at low risk and on the number of vacancies in hospitals to treat the disease. We may extend our model to explore these type of scenarios and we leave for future work.
Finally, another interesting research path is to evaluate the economic side effects of pandemic control 30,31 and to propose measures to minimize these impacts 32 .

Methods
The solution of the systems of differential equations. We find the numerical solutions of Eqs. (3) and (4) through integration using the explicit Runge-Kutta method of order 5(4) 33 . While this method controls the error assuming accuracy of the fourth-order, it uses a fifth-order accurate formula to take the steps. We use the implementation "solve_ivp" of the scipy Python's library.
The solution of the systems of differential equations depends on the definition of initial conditions. We use N 0 = 210147125 , that is the Brazilian population according to Brazilian Institute of Geography and Statistics (IBGE) which is the agency responsible for official collection of statistical, geographic, cartographic, geodetic and environmental information in Brazil, for both models. For the case of the SIRD model, we use S 0 = N 0 − 1 and I 0 = 1 . For the case, SIRASD model, we use I S0 = 1 and S0 = N 0 − I A0 − I S0 . We use I A0 = 1 in all simulations of the paper but the simulations presented in Fig. 7, since we want to learn about the effect of I A0 in the proportion of symptomatic and asymptomatic individuals in the peak date.
The estimation procedure. Our estimation procedure requires simultaneous integration of the differential equations (SIRD or SIRASD model depending on the case) and minimization of the loss functions [(5) or (6)] depending on the case for each time t. We minimize the loss functions using the method "optimize. least_squares" also from the scipy Python's library 34 using the cauchy loss with scaling parameter C = 2 35,36 . To minimize the impact of the initial point assumption and data incompleteness, we repeat the estimation procedure 100 times using random initial conditions, but we discarded estimations which did not converge. Since this is a difficult nonlinear problem we bound the parameters estimation region. In particular, we use the bounds presented in Table 4. To be clear, the fact that β S = β and γ S = γ is a consequence of our hierarchical estimation procedure previously described in "Results" section. Furthermore, β A ∈ [0, β S ] means that β A ≤ β S 7 , since the asymptomatic individuals do not have symptoms that may help the spread of the infection.
Finally it is worth mentioning that this estimation procedure is sensitive to the random seed used by the algorithm as an initial condition. In particular, depending on this seed, we have found different epidemiological parameters in different simulations of the SIRD model, as presented in Table 5. We have chosen the simulation results that provided the closest value of the median of the parameter γ , which is the one that used the random seed 7. We emphasize that although any of the presented epidemiological parameters could be a possible