Development of forecast models for COVID-19 hospital admissions using anonymized and aggregated mobile network data

Taghia, Jalil; Kulyk, Valentin; Ickin, Selim; Folkesson, Mats; Nyström, Cecilia; Ȧgren, Kristofer; Brezicka, Thomas; Vingare, Tore; Karlsson, Julia; Fritzell, Ingrid; Harlid, Ralph; Palaszewski, Bo; Kjellberg, Magnus; Gustafsson, Jörgen

doi:10.1038/s41598-022-22350-6

Download PDF

Article
Open access
Published: 22 October 2022

Development of forecast models for COVID-19 hospital admissions using anonymized and aggregated mobile network data

Jalil Taghia¹,
Valentin Kulyk¹,
Selim Ickin¹,
Mats Folkesson¹,
Cecilia Nyström²,
Kristofer Ȧgren³,
Thomas Brezicka⁴,
Tore Vingare⁵,
Julia Karlsson⁵,
Ingrid Fritzell⁵,
Ralph Harlid⁶,
Bo Palaszewski⁷,
Magnus Kjellberg⁸ &
…
Jörgen Gustafsson¹

Scientific Reports volume 12, Article number: 17726 (2022) Cite this article

1800 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Reliable forecast of COVID-19 hospital admissions in near-term horizons can help enable effective resource management which is vital in reducing pressure from healthcare services. The use of mobile network data has come to attention in response to COVID-19 pandemic leveraged on their ability in capturing people social behavior. Crucially, we show that there are latent features in irreversibly anonymized and aggregated mobile network data that carry useful information in relation to the spread of SARS-CoV-2 virus. We describe development of the forecast models using such features for prediction of COVID-19 hospital admissions in near-term horizons (21 days). In a case study, we verified the approach for two hospitals in Sweden, Sahlgrenska University Hospital and Södra Älvsborgs Hospital, working closely with the experts engaged in the hospital resource planning. Importantly, the results of the forecast models were used in year 2021 by logisticians at the hospitals as one of the main inputs for their decisions regarding resource management.

Forecasting hospital-level COVID-19 admissions using real-time mobility data

Article Open access 14 February 2023

Predicting regional COVID-19 hospital admissions in Sweden using mobility data

Article Open access 17 December 2021

COVID-19 in Switzerland real-time epidemiological analyses powered by EpiGraphHub

Article Open access 17 November 2022

Introduction

COVID-19 outbreaks have exhausted healthcare systems around the world. Concentration of admitted patients during outbreaks and limited resources at hospitals put pressure on healthcare systems. Knowing estimated number of admitted (hospitalized) patients in near-term horizons of two-to-three weeks can significantly facilitate resource management and planning. Forecasts of the number of admitted patients can serve as an important input for prediction of hospital resource allocation. However, developing forecast models for admitted COVID-19 patients has proven to be challenging^1,2,3,4,5 due to, among others, lack of historical data, involvement of many external factors, and most notably the evolving nature of COVID-19 outbreaks including evolution of SARS-CoV-2⁶, dynamic nature of people behavioral response to external factors such as regulations set by authorities⁷, increasing number of people with antibodies, and evolution of antibody immunity to SARS-CoV-2⁸.

Merely considering historical data on the number of COVID-19 hospital admissions for prediction of the future number of admissions is not sufficient, and that can result in forecast models that lack the novelty factor - in the sense that they fail to capture novel trends for which there are no precedence in the past. Effectively capturing trend changes helps in proactive decision making which is vital during outbreaks. Inclusion of external factors in forecast models might improve their efficacy in capturing the novel trends. However, it is not straightforward as there are many external factors to consider and it is difficult to determine their importance^{9,10,11,12,13,14}. Examples of such external factors are various temporal seasonalities (e.g., weekly and monthly seasonalities), public holidays, events, weather forecasts, regulations set by authorities, and changes in behavior of people at different phases of the pandemic. Aware of this, here, we argue in favor of using mobile network data of user activities as one of the main inputs for construction of the forecast models for COVID-19 hospital admissions in near-term horizons. We motivate use of mobile activity data for development of the forecast models by their inherent ability in capturing social behavior of people with respect to their physical movements in the society.

Inclusion of the mobile activity data, in addition to the historical data on COVID-19 hospital admissions, enables us to construct forecast models that maintain their novelty factor, leveraged on the approximate time lag between the point in time when people first come into contact with virus and the time when they are hospitalized. Our underlying hypothesis is that the user activities are positively correlated with the number of admitted patients, as the higher activity means concentration of more individuals in a limited area and in turn higher risks of exposure to SARS-CoV-2 virus.

During the COVID-19 pandemic, the use of mobile network data of user activities has seen several applications, such as to inform reopening strategies^15,16, for informing evidence-based policy making by authorities in attempt to manage the spread of SARS-CoV-2^17,18,19, early detection of COVID-19 outbreaks^20,21, and for informing COVID-19 forecast models²².

In a case study, we use irreversibly anonymized and aggregated geographical grid-level hourly mobile network data of user activities in Västra Götaland county in Sweden provided by Swedish operator Telia Sverige AB, and develop forecast models for prediction of the number of admitted COVID-19 patients at Sahlgrenska University Hospital (SU) located in Gothenburg and Södra Älvsborgs Hospital (SÄS) located in Borås. We describe development of the forecast models and discuss how insights from the models were used in planning and prediction of healthcare demands and resources.

Results

Development of the forecast model

Development of the forecast models for the near future prediction of the number of admitted COVID-19 patients using mobile network activity data is one of the main results of this study. Our forecast model pipeline is composed of three interconnected models, namely: the grid selection model, the spatiotemporal model, and the predictive model. Figure 1 shows the key components of the forecast model. A detailed description of the mathematical formulation and algorithmic implementations are provided in Methods.

There are three types of input data provided to the forecast model, namely: (i) historical data on the number of admitted COVID-19 patients aggregated daily per hospital, hereafter referred to as the COVID-19 admission data, (ii) external factors in the form of antibody and vaccination data, and (iii) privacy-preserving anonymized and hourly aggregated mobile activity data. The forecast model is fully data-driven which takes the three types of input data and produces prediction of the number of admitted patients for the duration of the forecast window.

We constructed two forecast models, one for SU and one for SÄS, providing daily predictions for the duration of 21 days. These two forecast models share the same underlying architecture, while being optimized separately. We proceed with briefly introducing the main three components of the forecast model pipeline.

Grid selection model

Mobile network data contain timeseries of aggregated hourly activities per grid in order of thousands. The grids are spread out across 49 municipalities in the Västra Götaland region, shown in Fig. 11. While hourly mobile activity data from the grids carry useful information about user activities, not all grids are equally relevant to the behavioural aspects related to COVID-19. Thus, there was a need for selection of the most relevant grids.

In construction of the grid selection model, we opted for a data-driven approach such that selected grids of interest can dynamically change over time as do user behaviors throughout the pandemic. As shown in Fig. 1a, the model takes in both historical data on grid-level hourly mobile activity data and COVID-19 admission data. It then selects clusters of grids that are most related to the user activities in connection to COVID-19.

Grid selection was performed on a weekly basis as planning at the hospitals were done weekly. Figures 2 and 3 illustrate the selected clusters of grids at selected analysis dates used for construction of the forecast models for SU and SÄS, respectively. Using tags taken from OpenStreetMap²³, one can identify the geographical objects that the selected clusters of grids represent for a given analysis date.

Spatiotemporal model

Hourly mobile activity data from selected clusters of grids contain useful spatial information about user activities. Additionally, these data are temporal in nature whose dynamics are affected not only by short-to-long range seasonalities, such as hourly and weekly seasonalities, but also various external factors, such as possibly evolution of antibody development and regulations set by authorities. This implies to the need for capturing temporal dynamics in modelling of such data.

Our hypothesis was that there are certain temporal patterns hidden in the mobile activity data that are particularly useful for the analysis of COVID-19. Thus, the forecast model was equipped with a spatiotemporal model. As shown in Fig. 1b, the model takes as its inputs hourly mobile activity data from selected clusters of grids, historical COVID-19 admission data and antibody data. It then constructs a spatiotemporal memory containing useful information about the short-to-long term dynamics in data. Specifically, the spatiotemporal memory contains latent spatiotemporal patterns in mobile activity data that satisfy the following two conditions: (i) they are one of the major spatiotemporal patterns in the data, (ii) and they are either statistically positively or negatively correlated with the number of admitted patients. The major spatiotemporal patterns are defined in Methods. Conceptually, the first condition ensures that only those latent spatiotemporal patterns are used for the subsequent correlation analysis that are supported by sufficient data samples.

Figures 4 and 5 show the (Pearson) correlation between the positively correlated spatiotemporal patterns and the daily number of admitted patients at SU and SÄS, respectively.

Predictive model

The forecast model is equipped with a predictive model in the form of a regressor. As shown in Fig. 1c, the predictive model takes as its inputs all available historical frames of the spatiotemporal memories, historical data on vaccination data, and historical COVID-19 admission data. It then produces predictions for the number of admitted patients for the duration of the forecast window, 21 days.

Considerations in development of the forecast model

Validation of the forecast models

Validation of the forecast models for COVID-19 was challenging due to the dynamic nature of pandemic. We took the following approach for the validation of the forecast models. For a given analysis date, we divided available historical data into a train set and a validation set. The forecast model parameters were tuned guided by the results on the validation set. We varied the size of the validation set, from one week to six weeks to find the best setting for the parameters of the forecast model. The setting of the parameters that performed well on average across all validation sets were used for the final analysis, referred to as the optimal parameter setting. Next, we trained the model on the entire historical data, using the optimal parameter setting, which provided final forecasts for the duration of the forecast window. Such validation procedure was performed on a weekly basis for both SU and SÄS forecast models.

Evaluation of the forecast models

Evaluation of the forecast models was done based on both visual inspection by healthcare subject matter experts and objective measures. The visual inspection of the forecasts was done to examine model performance in capturing important trends in data. We found that using primarily objective measures for the evaluation of the forecast models while useful can be sometimes misleading. As an example, a forecast model can miss out on capturing important trend changes while yet achieving reasonable performance based on the objective measures. It was found that the visual inspection of the predictions for the evaluation of the forecast models, by healthcare experts, can provide complementary insights.

Addressing the degeneracy problem of the forecast models

Training the forecast models involved minimizing a loss function between true and predicted number of admitted patients. We found that the forecast models often fall into degenerate solutions. The problem of degeneracy of a forecast model arises when a forecast model learns to “repeat the past” and by doing so it achieves misleadingly a low loss. This may be explained by noting that the COVID-19 admission data can be seen mostly as stationary signals containing relatively long and steady-state regions followed by sudden rare increasing or decreasing trends. The main issue with a degenerate model is that it is inherently unable to predict novel trend changes resulting in unreliable forecasts.

To reduce the degeneracy problem, we introduced a regularization to the loss function of the forecast models. The regularization was designed to discourage forecasts that are similar to the past and encourage uncovering novel trends. The addition of the regularization was the key in reducing the degeneracy problem in our forecast models for SU and SÄS. Construction of the regularization is discussed in Methods.

Inclusion of the external factors in the forecast model

As stated earlier, our main hypothesis in using mobile network data is that user activities are positively correlated with the spread of SARS-CoV-2 virus. However, as the antibody rate increases in population, user activities captured by the mobile activity data become less correlated with the number of admitted patients. The basis for this assumption is that the majority of individuals with antibodies would likely develop light symptoms which would not lead to hospitalization. To compensate for the effect of the antibody development in reducing predictive capabilities of the mobile activity data in relation to COVID-19, we considered antibody test data and vaccination data as the two external factors. However, between the two, vaccination data were given higher importance by the forecast model. For the case of the antibody test data, they were included indirectly through the spatiotemporal model while for the case of the vaccination data, they were included directly through the predictive model, as shown in Fig. 1. In Methods, we describe the exact mathematical formulation used for including the external factors in the forecast models.

Forecasts of the number of admitted patients at SU and SÄS

Figure 6a shows the predicted number of admitted patients at SU during course of pandemic from February 15, 2021, until June 23, 2021, provided by the 21-day forecast model. Forecasts were delivered as inputs to SU on a weekly basis in 17 analysis dates (deliverable dates). That means the forecast models were run at various analysis dates while providing predictions for the duration of the next 21 days. Figure 6b,c show the error in prediction per analysis date in terms of the mean-absolute-error (MAE) score and the percentage error (relative error) between true and predicted number of admitted patients, averaged across the duration of the forecast window. For the purpose of hospital resource planning, the forecasts from the most recent models were used by logisticians at SU. The most recent model is referred to the model built on using the latest available data at the time. Figure 6d shows forecasts from the latest models. The evolution of the forecast models is highlighted with markers indicating the major changes to the forecast model.

Similarly, Fig. 7a shows the predicted number of admitted patients at SÄS from April 19, 2021, until July 4, 2021, provided by the 21-day forecast model. Forecasts were delivered as inputs to SÄS on a weekly basis in 14 analysis dates. Figure 7b,c objectively evaluate the error in prediction per analysis date in terms of the MAE score and the percentage error between true and predicted number of admitted patients, averaged across the duration of the forecast window. Figure 7d shows forecasts from the latest models provided by the 21-day forecast model which were used by logisticians at SÄS for the resource-planning purpose.

Alternatively, we can study the quality of the forecasts by partitioning the forecast window into three separate weeks. Figure 8 shows the averaged percentage error per partition for SU and SÄS.

Discussion

The goal of the project was to provide insights that help hospitals in resource management and planning during the COVID-19 pandemic. We hypothesized that privacy-preserving mobile network data of user activities, that are irreversibly anonymized and aggregated, were reflective of the social activity of people in terms of their physical movement in the society. To validate our underlying hypothesis, we took a model-based approach as we believed the aspects of the mobile network data that are of interest for the analysis of COVID-19 are latent in the data. Thus, the idea was to extract those latent spatiotemporal patterns in mobile activity data that are of utmost relevance to the analysis of COVID-19 admission data.

The first step in achieving this goal was to extract the spatial information by selecting the geographical grids of interest to COVID-19. We first considered possibility of a hypothesis-driven approach in selection of the grids based on their geographical locations. However, the hypothesis-driven approach would not have taken into account the dynamic nature of people behavioural response to the evolving pandemic situation. As an example, restrictions set by authorities affect people social behavior and that in turn affects activities registered by the grids, differently. Therefore, instead of a hypothesis-driven approach, we decided to opt for a data-driven approach in selection of the grids. Figures 2 and 3 show that the selected clusters of grids - and what they represent - change dynamically throughout the pandemic. As an example, the effect of season change in people behavior has been captured in the selected grid clusters such that in winter time grids of interest are mostly concentrated around city center areas while in summer times, in addition to the city center areas, there are clusters of grids representing areas outside of the city such as parks and cottage areas.

The spatially relevant grid clusters were in the form of timeseries and it was important to capture as well those temporal dynamics in the data that are relevant for the analysis of COVID-19 admission data. We modeled the timeseries using the spatiotemporal model by decomposing data into a number of spatiotemporal patterns representing various temporal dynamics in the timeseries data. We then showed that there are indeed latent spatiotemporal patterns in mobile activity data that are statistically correlated with the number of admitted patients at the hospitals. Figures 4 and 5 show the correlation scores for the positively correlated spatiotemporal patterns throughout the pandemic for SU and SÄS, respectively. We observed that the correlation scores were considerably higher for SU than SÄS. This could be explained partly by the fact that SU is a larger hospital and its catchment area includes municipalities with higher population densities than the ones for SÄS. Hence, user activities captured in mobile activity data are better reflective of the people behavior.

In spatiotemporal modelling of the mobile activity data for the extraction of the correlated spatiotemporal patterns, we considered various lags between mobile activity data and the number of COVID-19 admitted patients. The lag duration was varied from 7 to 49 days with a step size of 7 days. For different lags, we computed Pearson correlation between spatiotemporal patterns extracted from mobile activity data and the number of admitted patients. We found that higher correlation scores were achieved for longer lags between 28 days and 45 days while, in most cases throughout the pandemic, the highest correlation score was achieved for the 35-day lag. This is shown for SU in Fig. 4 and for SÄS in Fig. 5.

Leveraged on the predictive capabilities of the correlated spatiotemporal patterns, we built the predictive model of the number of admitted patients which uses these patterns as one of the main input features in addition to the historical COVID-19 admission data. We found the spatiotemporal patterns having a complementary role which proved useful for construction of our 21-day forecast models. This is explained as follows. Historical data on the number of admitted COVID-19 patients are better predictive of the future number of admitted patients for shorter lags (lags smaller than 7 days), and they lose their predictive relevance as the lag increases beyond 14 days. However, mobile activity data were shown to be most relevant for longer lags (28 to 42 days) but to have relatively limited predictive relevance for shorter lags, smaller than 7 days. Taking into account these two input features concurrently helped the forecast models to harvest useful information for near-term (i.e., 21 days) prediction of the number of admitted COVID-19 patients.

In addition to the historical COVID-19 admission data and mobile activity data, the forecast models were enriched with additional inputs provided by the external factors related to the development of the antibody in population, namely antibody test and vaccination data, when they became available. Purposefully, we did not include effect of the external factors that are implicitly captured in mobile activity data such as weather condition, season, public transportation, and compliance to the regulations set by authorities. As an example, in the latter case, the implicit assumption is that the mobile network data of user activities is a proxy of compliance to the regulations.

It is important to note that the forecast model pipeline was developed during the pandemic. As the pandemic evolved, we needed to make changes to the forecast model. This is referred to as the evolution of the forecast model pipeline. Major changes to the models are highlighted in Figs. 6d and 7d. The changes to the forecast models are mostly related to the introduction of external factors to the forecast models. Antibody test data were included in the forecast models for SU on 2021-02-25 and for SÄS on 2021-04-24. However, later on, the free-of-charge offering of the test for inhabitants was discontinued. The associated cost with the test could have imposed biases in our forecast models, as the statistics on the antibody rates may not have been representative of the whole population. Therefore, the decision was made to not use these data for the subsequent analyses, effectively from 2021-05-17 onward for both SU and SÄS forecast models. Vaccination data were included in the forecast models of SU and SÄS on 2021-04-26. Initially, we did not have access to the age groups. From 2021-05-10, the effect of age group of the vaccinated population was included in the models as such data became available to us. Prior to 2021-05-25, we used linear effect in inclusion of the vaccination data. Since 2021-05-25, we used nonlinear effect where the non-linearity was learned from vaccination experience in Israel²⁴, as described in Methods. In terms of the methodology, the only major change to the forecast model pipeline was related to the grid selection model. Since 2021-04-19, we changed the method of grid selection from the distance correlation to the periodograms, as described in Methods. The transformation of the timeseries data to periodograms added reliability and freedom to choose the seasonality related frequencies in the data reflecting the latest state of the pandemic.

Forecast models for SU and SÄS were run regularly on a weekly basis as deliverables to the hospitals. At each deliverable, the forecasts for the duration of 21 days were provided to the logisticians at the respective hospitals. Figures 6 and 7 summarize the results. Crucially, in 16 out of 17 deliverables to SU, the percentage error, averaged across the duration of the forecast window of 21 days, was below 30% (Fig. 6c). In the case of SÄS, excluding the analysis dates where the total number of admitted patients were fewer than 15 patients, in 8 out of 9 deliverables, the percentage error was less than 30% (Fig. 7c).

In development of the forecast models, we have made several assumptions with respect to the input data, namely, data from external factors (i.e., antibody test and vaccination data), COVID-19 admission data, and mobile network data. As discussed earlier, through evolution of the forecast models, some of these assumptions were addressed - as an example effect of antibody development in the population was included in the forecast models through inclusion of vaccination data when they became available. However, some other assumptions remained throughout, which we believe addressing them could have improved the quality of the predictions from the forecast models. In the following, we state a few important examples of such assumptions. The first category of assumptions is with respect to the mobile network data of user activities. In this study, we used data provided by Swedish operator Telia Sverige AB. Telia has the largest market share based on the number of mobile subscriptions in Sweden with about 34.6 percentage. We have made an assumption that the data from Telia are representative of the population. However, this assumption may have introduced potential bias in our analysis specially considering the age group of the base subscribers. Furthermore, the mobile network data does not include activity of the users that are solely connected to Wi-Fi. Finally, there are other sources of uncertainty in mobile network data stem from missing values, privacy-preserving aggregation procedures, possible noise in data collection, and social or regulatory recommendations that advise users to refrain from using their phones. Aware of such limitations and inherent uncertainties in mobile network data, we showed that such data can provide useful insights about the global trends of user activities. The next assumption was with respect to the data from PCR testing which was not used by the forecast models. We believe including such data in the models, as an additional input, could have helped the forecast models - particularly, if such data were available at an early stage and were made on a population basis on all individuals presented with symptoms of COVID-19. In Västra Götaland county in Sweden, different vaccines were used including Comirnaty from Pfizer, Spikevax from Moderna and Vaxzevria from AstraZeneca. However, we were not provided with the exact information about the vaccine types. Thus, in using vaccination data in the forecast models, we assumed the same efficacy for all types. Information about the mutations and variants of SARS-CoV-2 virus, and the impact of evolving mutations and variants on COVID-19 vaccines were not considered in the models. Finally, in addition to mobile network data, including data from other sources such as wastewater could have potentially enriched the models. Authorities involved in the hospital resource management and planning were regularly informed about the assumptions made in the development of the forecast models and their limitations.

At SU, forecasts of the number of admitted COVID-19 patients were primarily processed by logisticians, as one of the key inputs for prediction of hospital beds for patients with COVID-19. It was done heuristically by adding the average of days the COVID-19 patients are hospitalized to each hospitalization case provided by the forecast model. Logisticians then could calculate the number of beds needed to take care of these patients. In addition to the forecast of admitted COVID-19 patients, at SU other inputs were used for the hospital resource management. These inputs varied over time because of changes in behavior of the population. The inputs used at SU for the longest of time was analyzing the increase and decrease of positive PCR-tests from primary care (Vårdcentral) in Gothenburg which is the area where the patients hospitalized at SU live. Other inputs used during the pandemic were for example increase or decrease in number of travelers at Västtrafik, which is the public transport company in Gothenburg and calls to Vårdguiden 1177 with the symptoms that are correlated to COVID-19 (Vårdguiden 1177 is a Swedish service providing healthcare by telephone and the central national infrastructure for Swedish healthcare online). SU also followed the content of virus in sewage water as yet another factor in their considerations^25,26. These different inputs were weighted together and then presented to the group in charge of the hospital resource management. This group subsequently made decisions whether to open or close wards and beds dedicated for patients with COVID-19. At SÄS, the forecasts on the number of admitted patients were used together with other indicators such as number of positive PCR in the community and cluster outbreaks in part of the region in order to make an estimation whether the number of admitted patients would increase, decrease or remain stable for the next 14-21 days. This estimation was used to adjust the estimated number of beds available for COVID-19 patients.

Collaboration between operative and academic departments have proved to be a key factor of success in the presented study. At the hospital level, there was a profound knowledge of how the disease itself influenced the need of both intensive care resources as well as of ordinary care facilities. It was observed that a rather constant factor of the admitted patients who were hospitalized needed intensive care (approximately 15%) and a higher fraction of the beds were occupied during the high waves of the pandemic by the same patients (approximately 25%). This insight called for a need of being able to estimate the number of patients that were admitted from time to time in order to always being able to correctly allocate resources to all patients who were imperatively in need of hospital care. SU found it extremely important to be able to continuously (on a weekly basis) forecast the number of admitted patients.

Collaboration with the academics and industry was regarded as a necessity in providing new opportunities for developing models that later proved very useful. The collaboration was performed with an open mind for the skill in each area that the different actors could provide. This resulted in a dynamic evolution of the knowledge of how models that could be of use could be produced. There were no economic or other constraints, such as a pre-designed overall protocol of research in the collaboration which allowed for free thinking that we feel is of great importance for developing this kind of models. Due to the immense complexity of a pandemic, such constraints would rather hinder than facilitate research that had to be performed at a reasonable pace in order to be operationally useful as the pandemic developed in its own unpredictable way. From the hospital perspective we have learned a lot from both the way the collaboration was initiated and performed and what kind of data we think would be of great use in future pandemic situation, in order to forecast the need of hospital resources. Decision makers can draw important insights from this work in early-stage formulation of sustainable strategies when it comes to recommendations to inhabitants how to behave, how to proceed with shut downs of municipal services, and providing healthcare region testing for infection at a high level.

This project has been a collaborative effort between the two hospitals (SU and SÄS) and the private companies (Ericsson and Telia). It was initiated as an effort to handle the difficult and alarming pandemic situation. The rapid project initiation and the positive project outcome show the importance of forming and maintaining active networks across industries, both private and public sectors. The project can be seen as an excellent example of how society can benefit from digitalization, for example, mobile phones, mobile networks and data-driven model development. An interesting aspect of the project was that the project outcomes could be used timely in operational plannings. This was at first through insights from visualization of the mobile network data. However, as the project progressed, more advanced outcomes (the forecast models) were generated and used gradually in practice. Throughout the project, close communication between the parties was prioritized and maintained at various stages of the project including problem formulation, interpretation of the results, limitations and proper usage of the forecast models.

Finally, the results would not have been possible without the close three-party collaboration, and certainly not in such a timely fashion that it was used while the pandemic was still ongoing.

Methods

Mobile activity data

Mobile activity is based on the signaling data between mobile phones and mobile cell towers. Data used in this study were provided by Swedish operator Telia Sverige AB containing hourly mobile activities in 49 municipalities located in Vvstra Götaland county in Sweden. Hourly activities are obtained from user equipments (UEs) and are aggregated at the grid level. A grid is defined as a geographical square-like area. The grid size (spatial resolution) is determined by the location of cell towers combined with the density of cell towers and the number of signals from mobile phones. The highest spatial resolution is $500\times 500$ (m) and the lowest resolution is $16\times 16$ (km). Figure 11 visualizes 3720 grids in 49 municipalities located in Västra Götaland county in Sweden. In this study, we concentrated on two parts of the Västra Götaland county namely Gothenburg (with a population of around 625000 inhabitants) and Borås (with a population of around 117000 inhabitants), where Sahlgrenska University Hospital and Södra Älvsborgs Hospital are located, respectively. Specifically, we limited the studied areas to cities including inner city, suburbs, rural-urban and frequently visited areas by inhabitants detected from the Swedish operator Telia’s travel matrix data¹⁷.

The raw data were made privacy-preserving through a procedure consisting of anonymization and aggregation, extrapolation, and spatial-temporal aggregation¹⁷. The anonymization uses mechanism, among other mechanisms, of k-anonymity of 5, that is, at all steps during the process there must be at least five mobile devices or otherwise the information is discarded automatically. Aggregation follows and then numbers are further adjusted to be representative for the full population. These estimated total number of user activities are obtained per grid on an hourly basis. An activity in this context is defined as a unique dwell within a grid of at least 20 minutes¹⁷.

Vaccination data

Aggregated vaccination data were supplied by Region Västra Götaland. In all analysis, we considered only the effect of dose one.

Antibody data

From October 2020, inhabitants in Region Västra Götaland could receive an antibody test to verify whether they have antibodies for COVID-19. While initially, the test was free of charge, in April 2021, the free-of-charge offering of the antibody test was discontinued. The antibody test data were used by the forecast models only when they were free of charge.

COVID-19 admission data on the number of admitted patients

At SU, for a patient to be counted as an admitted COVID-19 patient, there should be a positive PCR-test at the earliest of 14 days before the hospitalization. Therefor the number of admitted COVID-19 patients consists of not only patients confirmed with COVID-19 at the hospital when they seek care but also patients who have been tested positive for COVID-19 at for example a primary care center before seeking care at SU.

At SÄS, data on the daily number of COVID-19 admitted patient were provided by SÄS patient registering system. Patients that came to the emergency ward and were categorized as “pandemic cause” were tested and if found positive they were registered as admitted for COVID-19. The data are not public.

Preprocessing of mobile activity data

The mobile activity data were collected from geographical grids with dynamic sizes¹⁷. It was decided to normalize the data by the respective grid area. This changed the data units from raw mobile activity counts to mobile activity per square meter.

The outlier removal was a part of the cleaning procedure and was performed on the raw mobile activity data. It was based on two statistical concepts: (i) the median absolute deviations (MAD)²⁷, and (ii) kurtosis score computed from the historical data. A data point was marked as an outlier when its MAD distance and kurtosis score were greater than the respective thresholds.

Due to existence of trend changes (or concept drifts) in mobile activity data, we used double median absolute deviation (left-and-right MADs). The left-MAD was used to calculate the distance from the median of all points less than or equal to the median while the right-MAD was used to calculate the distance for points that were greater than the median. Thus, the MAD threshold was calculated as:

$$\begin{aligned} \Delta = \frac{\Lambda ({X})}{\Gamma (X)}, \end{aligned}$$

(1)

where $\Lambda ({X})$ is absolute deviation of the timeseries X and $\Gamma (X)$ is the double MAD of X.

For a given data point, a kurtosis score was calculated using the historical data including the current data point. If the kurtosis score was larger than the threshold, the data point was flagged as the outlier. The threshold for the kurtosis was set experimentally to 3 which corresponds to the kurtosis value for a univariate normal distribution.

No handling of the missing data were required during preprocessing step of the mobile activity data due to the fact that our grid selection model could handle missing data.