Abstract
The global spread of COVID19, the disease caused by the novel coronavirus SARSCoV2, has casted a significant threat to mankind. As the COVID19 situation continues to evolve, predicting localized disease severity is crucial for advanced resource allocation. This paper proposes a method named COURAGE (COUnty aggRegation mixup AuGmEntation) to generate a shortterm prediction of 2weekahead COVID19 related deaths for each county in the United States, leveraging modern deep learning techniques. Specifically, our method adopts a selfattention model from Natural Language Processing, known as the transformer model, to capture both shortterm and longterm dependencies within the time series while enjoying computational efficiency. Our model solely utilizes publicly available information for COVID19 related confirmed cases, deaths, community mobility trends and demographic information, and can produce statelevel predictions as an aggregation of the corresponding countylevel predictions. Our numerical experiments demonstrate that our model achieves the stateoftheart performance among the publicly available benchmark models.
Similar content being viewed by others
Introduction
COVID19 has been spreading globally and affected almost every country since 2020. In the United States (US), the COVID19 pandemic started spreading in January 2020, and in March, the daily number of confirmed cases and number of deaths rose to an alarming stage^{1}. To control the rapid spread of COVID19, policymakers in many states imposed movement restrictions and partial confinement to everyone. For example, companies, educational institutes and public places were forced to close or work in remote settings. This has significantly affected the nationwide economy^{2} and posed challenges to our daily life. More importantly, the severity of COVID19 resulted in the loss of lives and might have caused serious longterm complications^{3}. With nearly 31 million cases and 0.56 million deaths^{4} in the US alone, COVID19 has become a serious threat to mankind.
Since the beginning of the pandemic, different parties^{5,6,7,8} have made concerted efforts to collecting and publishing COVID19 related data, including confirmed cases, deaths, hospitalization information, demographics, and community mobility. Based on the publicly available data, researchers have built predictive models to study the disease dynamics. These efforts include compartmental models such as variants of SusceptibleInfectiousRecovered (SIR) models^{9,10}, statistical models using regression^{11} or time series analysis^{12}. Besides, researchers have also applied agentbased simulation modeling^{13,14} and deep learning models^{15,16,17} for predicting COVID19 dynamics. Moreover, the Centers for Disease Control and Prevention (CDC) has been leading a collaborative effort to produce an ensemble model from different research groups^{18} (see more detailed discussions of the related work in COVID19 dynamics prediction in a later section).
Two key measurements used by various research groups in the study of the spread of COVID19 are the number of confirmed cases and the number of deaths. Both measurements serve to measure the disease dynamics of COVID19. It should be noted that both the number of confirmed cases and the number of deaths are subject to similar biases that may affect the accuracy of the data, primarily due to reporting criteria^{7} or administration delay^{19}. However, the number of confirmed cases is often subject to additional measurement error due to the number of people tested and the testing procedure. For example, some datasets^{7} only present confirmed cases as the number of people having positive results for a completed polymerase chain reaction test. Such coarsegrained testing numbers may undercount the true spread of the COVID19, while introducing a significant bias to predictive models. Despite the caveat, the number of deaths is still a relatively better indicator of the intensity of COVID19 for policymakers to make decisions. As such, we will focus on predicting COVID19 related deaths in this study.
Furthermore, due to different disease spread severity in different geographical areas, policymakers need to tailor policies based on the particular local situation for the state or the county. Hence, accurate prediction at the state and county level is crucial for informed decisions. Indeed, the countylevel analysis could provide insightful information at finer granularity for policymakers^{16,20,21,22}. Policymakers can maximize resource allocation efficiency and react promptly in the legislative areas that require urgent attention. Therefore, we aim to build a model which can make shortterm predictions for the number of deaths at the county level. We can also produce predictions for the number of deaths at the state level by a simple aggregation of our countylevel predictions. Such a model could help both the state and local governments to make an informed decision based on the predictions at the corresponding state and county levels.
The shortterm prediction of COVID19 dynamics is essentially a classical time series modeling problem. For each day, we collect COVID19 related data as the input, and the desired output would be the predicted number of deaths for the next 2 weeks. On the next day, we obtain the new data from one additional day, and update our predictions for the next 2 weeks starting from the new date. This paper develops a fully datadriven approach to model COVID19 dynamics. Specifically, we build a selfattention deep learning model, which takes the input data from multiple sources and predicts the number of COVID19related deaths for the future 2 weeks at the county level in the United States. The statelevel prediction is then obtained by summing up all countylevel predictions of the corresponding state. Moreover, we propose a sequence mixup augmentation approach to further improve the training of our transformer model^{23}. We carefully evaluate our proposed methods under different training for different periods. We further compare our best model with other benchmark models to show its strength and usability.
Unlike most existing deep learning approaches based on recurrent neural networks (RNN) such as LSTMRNN^{24}, our proposed approach is based on a current stateofart selfattention model in Natural Language Processing, also known as the transformer model, which is able to capture both shortterm and longterm dependencies within the time series and enjoys computational efficiency. In particular, the basic building blocks of the transformer model are the selfattention modules, which directly model dependencies on previous time steps by assigning attention scores. A large score between two events implies a strong dependency, while a small score implies a weak one. In this way, the modules are able to adaptively select time steps that are at any temporal distance from the current time step. Therefore, the transformer model has the ability to capture both shortterm and longterm dependencies. Moreover, the nonrecurrent structure of the transformer facilitates efficient training of multilayer models. Practical transformer models can be as deep as dozens of layers, where deeper layers capture higher order dependencies. The ability to capture such dependencies creates models that are more powerful than RNNs, which are often shallow. Also, the transformer models allow full parallelism when calculating dependencies across all time steps, i.e., the computation between any two time steps is independent of each other. This yields a model presenting strong computational efficiency.
Related work
Epidemic prediction is a time series prediction problem. Given a sequence of time with corresponding data, a model needs to predict the target incidence in the future time. There are four main classes of predictive models for epidemic prediction: the compartmental model, simulation modeling, statistical model, and deep learning model. Moreover, for the same class of models, the final prediction may be an ensemble of several other models. In fact, on the CDC website, the final CDC prediction is obtained by ensembling all the submitted models^{18}.

Compartmental model is one of the most widely used model types for modeling epidemic diseases. It characterizes the disease spread dynamics using systems of ordinary differential equations. One of the most successful compartmental models is the SIR model, which is used to predict disease progression within the population in one area. In the SIR^{25,26} model, the population is assigned to Susceptible (S), Infectious (I), or Recovered (R) mode. One variant of SIR model is SEIR model^{27,28,29,30}, which introduces additionally Exposed (E) mode. Other variants of the SIR model include the SIRD^{31} model, with additional Deceased (D) mode. Any transitions from one mode to another mode (i.e., the disease spreading dynamics) are modeled as differential equations, often in the form of a transition matrix. Compartmental models are hard to be widely used due to the difficulties to determine the hyperparameters in every differential equation used^{32}. One highly successful compartmental model for COVID19 predictions is from Karlen’s group^{33}. Karlen’s model uses discretetime difference instead of ordinary differential equations to model the transition matrix.

Simulation modeling uses computer simulation to model different components in the studied environment and observe their interactions. Cellular automata and agentbased simulation are two simulation modeling techniques used to model complex systems^{34}. In COVID19 prediction, several groups^{13,14} use agentbased simulation due to its flexibility to simulate dynamic behaviours of systems with large number of entities. While highly flexible, agentbased simulation requires access to extensive computational resources. Besides, multiple simulations are needed for statistically sound observations, resulting in longer inference time. This limitation is noticeable when the simulated systems involve a large number of individual entities.

Conventional statistical models use regression methods to fit the data directly. Such models include ARIMA, Gaussian process regression, and linear regression, which are more flexible than compartmental models. However, most of the time, statistical model usage is limited by the need for more sophisticated handcrafted features, which often requires knowledge from the domain experts. For example, CLEP model^{11} uses an ensemble model of an exponential predictor and a linear predictor. Model from Zhu et al.^{21} uses spatiotemporal information among counties to make predictions. Model from Lampos et al.^{12} applies statistical model (Autoregressive model and Gaussian Process regression) to predict the disease’s dynamics across countries in Europe and the United Kingdom based on online search data.

Deep Learning models are deep neural networks that learn directly from input data. Such models are more flexible than both compartmental models and conventional statistical models. Due to their representation capability, such models need a less sophisticated handcrafting preprocessing of the input data. In time series prediction problems, some common deep learning models include Long shortterm memory (LSTM)^{24}, Gated Recurrent Unit (GRU)^{35}, and transformer^{36,37}. All the above models can capture intrinsic information from sequential data for accurate prediction. One limitation of such models is the need for large training data. Concurrently with our work, there are other deep learning models including models from Refs.^{16,17} that utilize attention mechanism from transformer architecture.
Different models have their merits in their performance for different date ranges. During the beginning of the COVID19 outbreak, due to the limitations of the available data, we see predictions from the compartmental models or statistical models. With more data available, deep learning models are showing their advantages in the model flexibility and prediction accuracy. It is also more challenging for a model to make accurate predictions as the prediction granularity becomes finer. Intuitively, errors are more likely to be accumulated if a model is tasked to predict more targets than only a few targets. However, finer prediction granularity, such as prediction at the county level, is highly desired since local predictions help policymakers make tailored policies based on individual counties.
Our contribution
In this paper, our goal is to predict the weekly total number of deaths at both county level and state level for the next 2 weeks, given the current week data. Each singleday data include the number of confirmed cases, the number of deaths, community mobility, and population. Our novelty is to connect established Natural Language Processing techniques with the COVID19 time series prediction. In particular, we build a selfattention model, also known as the transformer model in Natural Language Processing, that is able to capture both the short and long term dependencies within the time series input data. Our model is build upon two main ideas. First, by aggregating the accurate predictions from county level to state level, our model shows strong performance for the statelevel prediction task in all prediction periods. Then, we further improve our model, by experimenting with the feasibility of using data augmentation method. We implement mixup^{23} as our data augmentation method at the input layer. To the best of our knowledge, this is the first application of mixup data augmentation in COVID19 data. Data augmentation further improves our model performance, and is particularly helpful when new trends emerge. Using these two core ideas, our proposed method, named COURAGE (COUnty aggRegation mixup AuGmEntation), is able to generate a shortterm prediction of 2weekahead COVID19 related deaths for each county and state in the United States. When compared with other benchmark models, COURAGE shows strong performance across different periods, showing its strength and usability for the prediction of the COVID19 related number of deaths.
Results
To evaluate our proposed method, we compare countylevel and statelevel predictions using both the countylevel and statelevel testing sets with mean absolute error (MAE) as our comparison metrics. For countylevel predictions, we compare the predictions from our COURAGE model with its two member models—the County model and the Mixup model. We also compare our model with the baseline Naive model, which simply uses the previous week’s reported total number of deaths as the prediction. In the statelevel predictions, besides the previous mentioned models, we have an additional baseline State model, which is COURAGE prediction generated solely on the statelevel input data without any countylevel information or the mixup data augmentation. We compare the predictions according to the training period used, corresponding to 0.5, 0.6, 0.7, and 0.8 of the total dataset. We also present the performance of each model across multiple nonoverlapping periods. Finally, we compare our models with the available models contributed to the CDC forecast website.
Comparison among models
We summarize our comparison among our models in Table 1. We emphasize that our COURAGE model is an ensemble model that consists of two separate models: the County and the Mixup model. Those two can also serve as standalone models. We have the Naive model (a.k.a. persistence forecast model, which uses the previous week’s reported number of deaths as the forecast) as a baseline model in the countylevel prediction task. We use both the Naive model and the State model (prediction using only statelevel inputs) as baseline models for the statelevel prediction task.
The County model and the Mixup model produce more accurate predictions than the Naive model in the countylevel prediction task. By combining these two models, we improve our COURAGE model’s prediction accuracy. We also get accurate predictions from the County model and the Mixup model in the statelevel prediction task. They outperform both the Naive model and the State model. When combining both models, COURAGE has the best accuracy under different prediction periods, shy from the Week 2 predictions when using the training dataset dated from March 7, 2020, to December 1, 2020. We observe that the County model and the Mixup model have their strengths for different periods. Our COURAGE model often obtains the best of both models and produces the most accurate predictions.
Predictions for different periods
We show the number of deaths prediction for different periods in Table 2. Since we use different amount of datapoints as our training set, our training data consists of information across different periods. We take the nonoverlapping period from each testing set as a separate out of the sample prediction period. In all the periods, the County and Mixup models produce better predictions than the Naive model. The result shows the feasibility of both models. For the Mixup model, its strength is visible in the last two periods of prediction. In the last prediction period, we use our trained model (using training set with data from 20200307 to 20201201) to predict the latest data with a prediction period from 20210118 to 20210314. Our model is able to predict accurately for the new data, as illustrated by plot of prediction for New York in Fig. 1 and additional plots for other major states (Illinois, California, Texas, Arizona) in Supplementary Information. We mark in Fig. 1 with light grey lines for different prediction date ranges as that of Table 2. In the first two periods, the number of deaths is relatively low and stable. There is an increase in the number of deaths in the third period. Finally, we can see a huge increase or decrease in the number of deaths in the last two prediction periods. In scenarios with relatively stable trends, the County model provides better predictions than that of the Naive model. With a more drastic change in the trend, a model trained with mixup data augmentation generalizes better, hence outperforming both Naive and County models. When combining predictions from the County and Mixup models, COURAGE model provides a balanced and more accurate prediction across different periods.
Comparison with benchmark models
We obtain all publicly available predictions from the CDC website for comparison^{9,15,18,33,38,39,40,41,42,43,44,45,46,47,48,49,50,51}. Different research groups or individuals contribute towards these predictions. CDC compiles predictions from every model that submits their weekly predictions and uploads them to the CDC website. Note that all prediction models have their strengths in different date ranges. Moreover, they outperform most of the simple methods. We compare our models’ predictions with those who send predictions to the CDC official website. The comparison is done for predictions from November 11, 2020 to February 6, 2021. For each model, we calculate MAE for Week 1 and Week 2 predictions, and we calculate the average MAE. We summarize the result in Table 3. Then we rank each model using the average MAE obtained for both Week 1 predictions and Week 2 predictions. Due to the extensively long list of models, we only show results from the top 10 models in both tables. From Table 3, we can observe that our County model produces competitive accuracy for the number of deaths prediction. Our County model is a top 10 model in terms of prediction accuracy. When we augment our dataset, our Mixup model further improves the prediction accuracy. We wish to examine which period mixup data augmentation contributes the most to model accuracy. Mixup data augmentation improves our model when a new trend emerges and helps our model to achieve better prediction in the last period, as illustrated in Table S1 of Supplementary Information. By combining strengths from the County and Mixup models, our COURAGE provides a good balance across different prediction periods.
Discussion
From our results, we see that our COURAGE model gains its strengths from both member models, which are the County model and the Mixup model. The two core ideas associated with these member models are the aggregation of the highquality county predictions (County model) and data augmentation (Mixup model).
The aggregation of highquality county predictions results in highquality state predictions. We could see highquality countylevel predictions from County and Mixup methods from Table 1. This result shows our model and training are effective in extracting and utilizing the information from the input data. When we aggregate these accurate countylevel predictions, our models can produce statelevel predictions that outperform baseline models. As an illustrative comparison, the State model (baseline model) uses only statelevel data as input. Such input is limited and of coarser grain, making it harder for the State model to produce accurate predictions. As a result, the State model’s performance pales when compared with our model, as well as the Naive model. One way to refine statelevel prediction is through the use of countylevel data. By training our models using the larger countylevel dataset, we obtain accurate countylevel predictions. When we sum these highquality countylevel predictions, we obtain accurate statelevel predictions. We justify our intuition from statelevel predictions of Table 1, by showing a strong result from our model.
The data augmentation from Mixup method also played a significant role in improving the accuracy of our predictions. Due to the fastevolving dynamics of COVID19, most models that submit their predictions to the CDC have high accuracy only for a certain period. Existing trends are much easier to fit than emerging trends, and any changes of an existing trend are hard to predict due to the scarcity of data available at the onset of changes. The fundamental reason for prediction deterioration is the lack of visibility of the new trend given the limited data available. If we could increase the amount of available data by incorporating new data, predictive models could be improved when a new trend emerges. However, data generation is hard for any new trend. One way to improve the model’s prediction on emerging trends is to improve its generalization with augmented input data. Our Mixup model uses a wellproven data augmentation technique to create new input data. Such training helps our model achieve accurate predictions in the emergence of unseen data.
By combining the strength of each member model (the County model and the Mixup model), we obtain the COURAGE model. Our COURAGE model obtains its prediction by averaging County prediction and Mixup prediction. This simple ensemble method allows our COURAGE model to decrease the variance in different models and achieve a balanced model.
While our COURAGE model shows strong results, one limitation of our current model is that the selfattention matrix from the encoder is not easily translated to an explainable pattern. Our model’s predictions use information from the confirmed cases, deaths, population, and mobility data. Their interaction is encoded by the encoder, with self attention as a key mechanism. Hence, it will be beneficial to the research community if we could see any important relationship among these inputs through the attention matrix. Besides, our model is a relatively concise model that leverages temporal data. In our future work, we plan to extend our work to include spatial information such as interaction among counties or major cities. We expect that the inclusion of such geographical information would further improving our model’s predictions. Currently, our model predicts 2 weeks ahead predictions. We plan to include predictions of a longer horizon up to 4 weeks ahead predictions, and also generate a probabilistic forecast that explicitly accounts for forecast confidence.
Methods
In this section, we present our data sources and data processing used in this paper. We also present details of our transformerbased model, mixup data augmentation technique, and training procedure.
Data sources
We use three comprehensive datasets in this study, including confirmed cases, deaths, population, and community mobility from two sources. We only focus on states in the mainland of the United States and do not consider Hawaii, Alaska, and other unincorporated territories in this paper. We use data from 47 states and 3206 counties.
Confirmed cases and deaths of Covid19 We use data from the JHU CSSE Covid19 dataset^{6}. This publicly available data is a curated dataset from different sources. The data used is collected from January 22, 2020, to February 7, 2021. We use confirmed cases and deaths from the dataset for every county from 47 targeted states.
Community mobility Mobility data have been shown to help the risk analysis of COVID19 and enhance the predictions of the COVID19 related deaths and cases^{21,52,53,54,55,56}. In current article, we use Google mobility data^{8} to incorporate mobility information in our model. These reports record a community’s daily movement by county for different areas such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. There is a baseline day’s value for every day of a week, which represents the usual community mobility value. The baseline day’s value is the median of the data from January 3, 2020, to February 6, 2020. Google mobility data reports the movement pattern of the US community. Each day’s data is in the percentage of changes from the baseline day’s value. A negative value for a particular day indicates a decrease in time spent at the area from the baseline day, while a positive value indicates an increase in mobility from the baseline day.
Demographic information We use population information from the JHU CSSE Covid19 dataset^{6} which includes population data for each corresponding county.
Data preparation
We retrieve features required for our model training, including the number of confirmed cases, the number of deaths, population, and mobility information from our datasets. We consolidate all input data into statelevel and countylevel datasets. We also include smoothed (average over past 7 days) confirmed cases and deaths as our input features. We separate total data into training and testing datasets for each corresponding level. To test our method for a different period, we try different amounts (50%, 60% and 70%, and 80%) of the total data as our training dataset, leaving the remaining data as our testing dataset. Since we use multiple features as inputs, we apply standardization to the inputs to accommodate differences in scale for each input. We also use additional data from February 8, 2021, to March 14, 2021, to test the model trained using 80% of the data in order to examine whether our trained model can continue to perform well for a new period without additional training.
Transformerbased model
The prediction of COVID19 related number of deaths given a sequence of input is a time series modeling problem. In a typical time series prediction setting, given a sequence of previous days’ number of deaths, the aim is to predict the number of deaths for the next future day. For our current article, our prediction problem is different from this typical time series problem. In the current article, our input consists of the current week’s number of deaths, number of confirmed cases, smoothed (average over 7 days) number of confirmed cases, smoothed number of deaths, community mobility data, and population of the area. More importantly, instead of predicting the daily number of deaths, our model predicts the weekly total number of deaths for the next 2 weeks (Week 1 and Week 2), given the current week (Week 0) input data. We join the current week input information together to form a data vector of dimension K. The input for our COVID19 prediction problem can be viewed as a sequence of Kdimensional vectors. Suppose we are given a sequence \({\mathcal {S}} = \{k_j\}_{j=1}^L\) of L days data, where each singleday data \(k_j \in {\mathbf {R}}^K\), occurs at time j.
The key ingredient of our transformerbased model is the selfattention module. Different from RNNs, the attention mechanism does not have recurrent structure. In order to incorporate the temporal information into the inputs, we use the original positional encoding method^{36} to our data vector. Alternatively, we could also use other positional encoding methods such as relative position^{57} to provide the temporal information for each of singleday data vector in our input sequence.
Before passing to the attention module, we first transform our sequence of singleday data vectors using a matrix \({\mathbf {U}} \in {\mathbf {R}}^{M \times K}\). After the transformation, for any singleday data \(x_j\) and its corresponding time stamp j, the temporal vector \(z_j\) and the singleday data vector \({\mathbf {U}}k_j\) both reside in \({\mathbf {R}}^M\). Given a sequence of L days data \({\mathcal {S}} = \{k_j\}^L_{j=1}\), we get
where \({\mathbf {E}} = [k_1,k_2,\dots ,k_L] \in {\mathbf {R}}^{K \times L}\) is the sequence of singleday data vectors, \({\mathbf {Z}} = [z_1, z_2,\dots , z_j] \in \mathbf {M \times L}\) is the concatenation of the temporal vectors.
We pass \({\mathbf {X}}\) through the selfattention module. Specifically, we compute the attention output \({\mathbf {S}}\) by
Here, \({\mathbf {Q}}\), \({\mathbf {K}}\), \({\mathbf {V}}\) are the query, key and value matrices obtained by different linear transformations of \({\mathbf {X}}\), and \({\mathbf {W}}^Q\), \({\mathbf {W}}^K \in {\mathbf {R}}^{M \times M_K}\), \({\mathbf {W}}^V \in {\mathbf {R}}^{M \times M_V}\) are weights for the respective linear transformations. In practice, multihead selfattention increases model flexibility and is beneficial for data fitting. In multihead selfattention, different sets of weights \(\{{\mathbf {W}}^Q_h,{\mathbf {W}}^K_h,{\mathbf {W}}^V_h\}^H_{h=1}\) are used to compute different attention outputs \({\mathbf {S}}_1,{\mathbf {S}}_2,\dots ,{\mathbf {S}}_H\). The final attention output for the sequence is then obtained by concatenating all the attention outputs and passing through the final linear transformation.
where \({\mathbf {W}}^O \in {\mathbf {R}}^{HM_V \times M}\) is an aggregation matrix.
We highlight that the selfattention mechanism allows the selection of any singleday data whose occurrence time is at any distance from the current time. The jth column of the attention score from the Softmax \((\mathbf {QK}^T/\sqrt{M_K})\) indicates the extent of dependency of jth singleday data (\(k_j\)) on its history. Hence, attention mechanism allows the capturing of short and long term dependencies of the sequence data. On the other hand, RNNbased models encode the data’s history sequentially via hidden representations of events, where the state of j depends on that of \(j1\), which in turn depends on \(j2\), etc. If the RNN fails to learn sufficient information for singleday data at j, subsequent hidden representation of any other singleday data at t where \(t \ge j\) will be adversely impacted.
The attention output \({\mathbf {S}}\) is then fed through a positionwise feed forward neural network, generating a hidden representation \({\mathbf {h}}(j)\) of the input data sequence:
Here \(\mathbf {SW_1^{FF}} \in {\mathbf {R}}^{M \times M_H}\) , \(\mathbf {SW_2^{FF}} \in {\mathbf {R}}^{M_H \times M}\), \({\mathbf {b}}_1 \in {\mathbf {R}}^{M_H}\),\({\mathbf {b}}_2 \in {\mathbf {R}}^{M}\) are the corresponding weights and biases of the feed forward neural networks. The resulting matrix \({\mathbf {H}} \in {\mathbf {R}}^{L \times M}\) contains hidden representations of all the information in the input sequence, where each row corresponds to a particular information. We use this final representation as an input to our linear decoder layer and obtain our predictions of the weekly total number of deaths for next 2 weeks.
In a typical time series prediction setting, the number of deaths prediction only forcasts the next day given the current week data. In such typical time series prediction, we need to implement masking for the attention mechanism to prevent “peeking into the future” issue. The masking allows any jth data to attend only to any tth data where \(t \le j\). In the current article, our model is predicting the weekly total number of deaths for the next 2 weeks (Week 1 and Week 2), given the current week (Week 0) input data. This setting frees us from such masking requirement since the model is implicitly masked from accessing the future total number of deaths from the current week data.
A transformer based model allows us to stack multiple selfattention modules together, and inputs are passed through each of these modules sequentially. In this way our model is able to capture high level dependencies. We remark that stacking RNN/LSTM are susceptible to gradient explosion and gradient vanishing, rendering the stacked model more difficult to train. Figure 2 illustrates the architecture of our transformerbased model used in this project.
Mixup data augmentation
Data augmentation is a commonly used technique to improve the deep learning models’ generalization. One recent augmentation method is mixup^{23,58,59} data augmentation. In mixup data augmentation, given \({\mathbf {X}}\) as the input space of total training data and \({\mathbf {Y}}\) as the corresponding output values space, each training set is a pair of \((x_i, y_i)\). Mixup data augmentation constructs new data by interpolating it from existing data.
where \((x_i, y_i)\) and \((x_j, y_j)\) are two examples drawn at random and \(\lambda \in [0, 1]\). In our model, we use mixup data augmentation at input layer.
Training objective
We train the transformer model using the Huber loss function. Specifically, the training objective is defined as
\({\mathbf {X}}\) and \({\mathbf {Y}}\) are the input space and the target space, with a pair of testing sample as \((x_i, y_i)\). We have n samples, and \(\delta\) is a tuning hyperparameter.
Training details
We use the transformerbased model (refer to “Transformerbased model” section) for predicting both countylevel and statelevel number of deaths. Specifically, we use a transformer encoder of 32 model dimensions, 1 encoder layer with 8 attention heads and 64 feedforward dimensions. The output of the encoder layer connects to a single linear layer decoder for predicting both weekly total number of deaths for next 2 weeks (Week 1 and Week 2) using the current week input data. We use Adam^{60} optimizer for its superior empirical performance in training a neural network and set 0.001 as our initial learning rate. For every 100 epochs, we decay the learning rate by half for a total of 500 epochs of training. Figure 3 illustrates our training process.
We use both the statelevel and countylevel training sets to train our models. We use the smoothed number of deaths as our prediction target during training. Upon completion, all the models are used to predict weekly total number of deaths for the next 2 weeks (Week 1 and Week 2) using countylevel dataset. We sum all the county predictions of the corresponding state as the predictions of that state. We denote the model trained as the County model. For the Mixup model, there is an additional mixup data augmentation applied to the input layer during the training phase. Our COURAGE model is an ensemble model of two member models, the County model and the Mixup model. COURAGE takes the average of both predictions from the County model and the Mixup model as its prediction. Our baseline model is a Naive model that takes the current week total number of deaths as both Week 1 and Week 2 total number of deaths predictions. For the statelevel prediction task, we have another baseline model—a State model that uses only statelevel data as input when making the predictions.
Conclusion
In summary, this article presents the new model COURAGE for COVID19 predictions at county level and state level for the United States. We use countylevel data to train COURAGE and obtain statelevel predictions through aggregating high quality countylevel predictions. We improve our model using mixup augmentation and ensemble predictions from both the County and Mixup models as our final output. To the best of our knowledge, our model is the first model that use mixup data augmentation to improve the accuracy of COVID19 related number of deaths prediction. Our experiment shows that this new application of mixup data augmentation helps improve the model’s prediction accuracy when new trends occur. COURAGE is a flexible model, that each member model (the County model and the Mixup model) can be used as a standalone model to produce accurate predictions in different periods. When both member models are ensembled together, COURAGE achieves accurate predictions across all periods. COVID19 is a serious crisis affecting our daily life and economy. Accurate predictions of disease dynamics is a challenging task, especially when new trends emerge. We hope that through our new training method, we can improve COVID19 number of deaths predictions and provide insight for resource allocation and disease control planning.
References
CDC data tracking. https://covid.cdc.gov/coviddatatracker.
COVID19 Economic Crisis. https://carsey.unh.edu/COVID19EconomicImpactByState.
LongTerm Effects of COVID19. https://www.cdc.gov/coronavirus/2019ncov/longtermeffects.html.
Coronavirus in U.S.:Latest Map and Case Count. https://www.nytimes.com/interactive/2020/us/coronavirususcases.html (Accessed 7 Apr 2021).
Times, The New York. Coronavirus (Covid19) Data in the United States (2021).
Dong, E., Du, H. & Gardner, L. An interactive webbased dashboard to track COVID19 in real time. Lancet Infect. Dis. 20, 533–534 (2020).
COVID Tracking Project. https://covidtracking.com/.
Google LLC. Google COVID19 Community Mobility Reports. https://www.google.com/covid19/mobility/ (Accessed 16 Mar 2021).
COVID19 Simulator. https://covid19sim.org/documents/policymethods/.
Interpretable sequence learning for COVID19 forecasting. https://cloud.google.com/solutions/interpretablesequencelearningforcovid19forecasting.
Altieri, N. et al. Curating a COVID19 data repository and forecasting countylevel death counts in the United States. Harvard Data Sci. Rev. https://doi.org/10.1162/99608f92.1d4e0dae (2020).
Lampos, V. et al. Tracking COVID19 using online search. npj Digit. Med. https://doi.org/10.1038/s4174602100384w (2021).
Kerr, C. C. et al. Covasim: An agentbased model of COVID19 dynamics and interventions. medRxiv https://doi.org/10.1101/2020.05.10.20097469 (2020).
Germann, T. C. et al. Using an agentbased model to assess K12 school reopenings under different COVID19 spread scenarios—United States, school year 2020/21. medRxiv https://doi.org/10.1101/2020.10.09.20208876 (2020).
Rodríguez, A. et al. DeepCOVID: An operational deep learningdriven framework for explainable realtime COVID19 forecasting. medRxiv https://doi.org/10.1101/2020.09.28.20203109 (2020).
Gao, J. et al. STAN: Spatiotemporal attention network for pandemic prediction using realworld evidence. J. Am. Med. Inform. Assoc. 28, 733–743. https://doi.org/10.1093/jamia/ocaa322 (2021).
Jin, X., Wang, Y.X. & Yan, X. Interseries attention model for COVID19 forecasting. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), 495–503, https://doi.org/10.1137/1.9781611976700.56 (2021).
Ray, E. L. et al. Ensemble forecasts of coronavirus disease 2019 (COVID19) in the U.S. medRxiv https://doi.org/10.1101/2020.08.19.20177493 (2020).
Jessi, M. & Luis, F. New York Severely Undercounted Virus Deaths in Nursing Homes, Report Says, Retrieved from https://www.nytimes.com/2021/01/28/nyregion/nursinghomedeathscuomo.html (2021).
Li, D. et al. Identifying US countries with high cumulative COVID19 burden and their characteristics. medRxiv https://doi.org/10.1101/2020.12.02.20234989 (2021).
Zhu, S. et al. Highresolution Spatiotemporal Model for Countylevel COVID19 Activity in the U.S. arXiv:2009.07356 (2020).
Chande, A. et al. Realtime, interactive website for UScountylevel COVID19 event risk assessment. Nat. Hum. Behav. 4, 1313–1319. https://doi.org/10.1038/s41562020010009 (2020).
Zhang, H., Cissé, M., Dauphin, Y. N. & LopezPaz, D. mixup: beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings (OpenReview.net, 2018).
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780 (1997).
Harko, T., Lobo, F. S. & Mak, M. Exact analytical solutions of the SusceptibleInfectedRecovered (SIR) epidemic model and of the SIR model with equal death and birth rates. Appl. Math. Comput. 236, 184–194. https://doi.org/10.1016/j.amc.2014.03.030 (2014).
Chen, Y.C., Lu, P.E., Chang, C.S. & Liu, T.H. A timedependent SIR model for COVID19 with undetectable infected persons. IEEE Trans. Netw. Sci. Eng. 7, 3279–3294. https://doi.org/10.1109/tnse.2020.3024723 (2020).
Hethcote, H. W. The mathematics of infectious diseases. SIAM Rev. 42, 599–653 (2000).
Yang, Z. et al. Modified SEIR and AI prediction of the epidemics trend of COVID19 in China under public health interventions. J. Thorac. Dis. 12, 165 (2020).
Xu, C., Yu, Y., Chen, Y. & Lu, Z. Forecast analysis of the epidemics trend of COVID19 in the USA by a generalized fractionalorder SEIR model. Nonlinear Dyn. 101, 1621–1634. https://doi.org/10.1007/s11071020059463 (2020).
Guo, L., Zhao, Y. & Chen, Y. Management strategies and prediction of COVID19 by a fractional order generalized SEIR model. medRxiv https://doi.org/10.1101/2020.06.18.20134916 (2020).
Caccavo, D. Chinese and Italian COVID19 outbreaks can be correctly described by a modified SIRD model. medRxiv https://doi.org/10.1101/2020.03.19.20039388 (2020).
Baek, J. et al. The Limits to Learning a Diffusion Model. arXiv:2006.06373 (2021).
Karlen, D. Characterizing the spread of CoViD19. arXiv:2007.07156 (2020).
Sayama, H. Introduction to the Modeling and Analysis of Complex Systems (Open SUNY Textbooks, 2015).
Cho, K. et al. Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734, https://doi.org/10.3115/v1/D141179 (Association for Computational Linguistics, 2014).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30 (eds Guyon, I. et al.) (Curran Associates, Inc., 2017).
Zuo, S., Jiang, H., Li, Z., Zhao, T. & Zha, H. Transformer Hawkes Process. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119 of Proceedings of Machine Learning Research (eds D. III, H. & Singh, A.) 11692–11702 (PMLR, 2020).
COVID19 Modeling. https://bobpagano.com.
Microsoft. https://www.microsoft.com/enus/ai/aiforhealth.
Oliver Wyman Pandemic Navigator. https://pandemicnavigator.oliverwyman.com/.
CMU Delphi Group. https://delphi.cmu.edu/.
Los Alamos National Laboratory. https://covid19.bsvgateway.org/.
University of Massachusetts–Mechanistic Bayesian model. https://github.com/dsheldon/covid.
Wang, L. et al. Spatiotemporal Dynamics, Nowcasting and Forecasting of COVID19 in the United States. arXiv:2004.14103 (2020).
MOBS lab Analysis of the COVID19 Epidemic. https://www.mobslab.org/2019ncov.html.
Srivastava, A., Xu, T. & Prasanna, V. K. Fast and Accurate Forecasting of COVID19 Deaths Using the SIkJ\(\alpha\) Model arXiv:22007.05180 (2020).
Lega, J. Parameter estimation from ICC curves. J. Biol. Dyn. 15, 195–212 (2021).
Wu, D. et al. DeepGLEAM: a hybrid mechanistic and deep learning model for COVID19 forecasting. CoRR arXiv:2102.06684 (2021).
UGACEID. https://github.com/CEIDatUGA/COVIDstochasticfitting.
London School of Hygiene and Tropical Medicine. https://www.lshtm.ac.uk/research/centres/centremathematicalmodellinginfectiousdiseases/covid19.
Steve McConnell CovidComplete. https://stevemcconnell.com/covidcomplete/.
Zachreson, C. et al. Risk mapping for COVID19 outbreaks in Australia using mobility data. J. R. Soc. Interface 18, 20200657 (2021).
James, N. & Menzies, M. Efficiency of communities and financial markets during the 2020 pandemic. arXiv:2104.02318 (2021).
Chicchi, L., Giambagli, L., Buffoni, L. & Fanelli, D. Mobilitybased prediction of SARSCoV2 spreading. arXiv:2102.08253 (2021).
Gösgens, M. et al. Tradeoffs between mobility restrictions and transmission of SARSCoV2. J. R. Soc. Interface 18, 20200936 (2021).
Carroll, C. et al. Time dynamics of COVID19. Sci. Rep. 10, 21040. https://doi.org/10.1038/s41598020777094 (2020).
Shaw, P., Uszkoreit, J. & Vaswani, A. Selfattention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 2 (Short Papers), 464–468, https://doi.org/10.18653/v1/N182074 (Association for Computational Linguistics, 2018).
Guo, H., Mao, Y. & Zhang, R. Augmenting Data with Mixup for Sentence Classification: An Empirical Study. CoRR arXiv:1905.08941 (2019).
Verma, V. et al. Manifold mixup: Better representations by interpolating hidden states. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, (eds Chaudhuri, K. & Salakhutdinov, R.) 6438–6447 (PMLR, 2019).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (2015).
Author information
Authors and Affiliations
Contributions
S.E., S.Y., and T.Z. designed the research; S.E., S.Y., and T.Z. performed the research; S.E. analyzed data; and S.E., S.Y., and T.Z. wrote the paper. S.Y. and T.Z. contributed equally to this work.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Er, S., Yang, S. & Zhao, T. COUnty aggRegation mixup AuGmEntation (COURAGE) COVID19 prediction. Sci Rep 11, 14262 (2021). https://doi.org/10.1038/s41598021935456
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021935456
This article is cited by

An epidemiological modeling framework to inform institutionallevel response to infectious disease outbreaks: a Covid19 case study
Scientific Reports (2024)

COVID19 forecasts using Internet search information in the United States
Scientific Reports (2022)

Iterative datadriven forecasting of the transmission and management of SARSCoV2/COVID19 using social interventions at the countylevel
Scientific Reports (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.