Main

The COVID-19 pandemic has precipitated unprecedented global public health policy interventions and population-level behavioural change. Understanding localized outbreaks of SARS-CoV-2 can be exceedingly difficult due to the inevitable ascertainment bias that occurs through a test-by-request strategy1. Furthermore, case, hospitalization and mortality data are lagged indicators due to the delay between infection and the report and clinical outcome date2,3. This can be hampered by the tendency for outbreaks to initially cluster in younger-aged demographics where, due to decreased infection severity and a higher proportion of asymptomatic infections, their representation in clinical data can be more limited4. Based on the strength of assortative mixing, it typically takes several weeks or months for infections to bleed into older-aged demographics, the point at which healthcare indicators can detect a substantial outbreak. It is therefore requisite for effective syndromic surveillance to look at data that are auxiliary to clinical outcomes for the early detection and identification of outbreaks so that adequate public health interventions may be able to limit transmission.

For a respiratory disease such as COVID-19, understanding the mobility patterns of individuals is central to calculating the transmission rate (β) and the force of the infection (λ) by more accurately tracking the effective contacts between individuals. The COVID-19 pandemic has allowed for mobility data to be used by telecoms providers5, web-based companies6,7 and public transport organizations8 to help understand contact patterns and adherence to non-pharmaceutical interventions (NPIs). Jeffrey et al.9 used mobility data from O2 and the Facebook application as a proxy for the actual mobility patterns in the United Kingdom to monitor the adherence to the March 2020 national lockdown. This type of data has been further employed effectively to model dynamic mobility networks to enable the simulation of the SARS-CoV-2 epidemic10. There are, however, issues with relying on mobility data as a proxy for effective contacts11 because increased mobility in itself may not correlate with increased transmission. This is most palpable around the exponential phase of a novel SARS-CoV-2 variant and any analyses employing these data must contextualize changes as only a component of behaviours that can be associated with increases or reductions in transmission12. The use of mobility data seems to have had the greatest utility when used for nations or localities where there has been a policy of NPIs and the associated behavioural change can be more effectively interpreted9; such data continues to be used for transmission modelling by the Scientific Advisory Group for Emergencies (SAGE)13,14 for the UK Government.

The use of digital searching and reporting may offer insights for syndromic surveillance that cannot be found from conventional epidemiological indicators. Platforms such as Google Trends offer an interface to analyse relative search volumes (RSVs) for a given locality, providing real-time monitoring of morbidity within populations. Internet reporting behaviour has been explored for emerging pathogens such as Ebola15, and found to be of analogous utility to traditional surveillance for monitoring clusters and outbreaks. Google Trends has demonstrated the potential for the monitoring of the respiratory virus H1N1 and the vector-based West Nile virus16. Furthermore, Google Trends data have been used to assess adherence to NPIs for the COVID-19 pandemic in Taiwan by looking at, for instance, the demand for face masks17. Past research18,19,20,21 has discussed Google Trends as a data source for identifying early increases in national incidence of COVID-19 in much the same way as Google Flu Trends. The Google Flu Trends22 surveillance experience23 illustrated the need for secondary data sources, intelligently designed algorithms and constant recalibration as an epidemic evolves to account for temporal changes in searching behaviour. The fusion of Google Trends with multiple data sources (Twitter, hospitalizations and Google Flu Trends) as a surveillance system for influenza24 showed improved model performance over using Google Trends alone and illustrated the promise for the application of this approach to detect changes in the transmission of SARS-CoV-2.

The interpretation of testing data at local scales is obfuscated25 by ascertainment bias, where key groups of interest can be excluded through: the geographic impracticality of testing centres, financial implications if compelled to isolate, asymptomatic infection, and test availability at the time of infection or symptom onset. International research has highlighted the increased positivity and lowest testing rates for COVID-19 in the most deprived areas26,27,28. This has been shown to be exacerbated in deprived rural areas with less connectivity to local testing centres, and in minority groups that can be excluded from effective public health messaging. Further work has emphasized the lack of synchronicity that can be observed between testing data and clinical outcomes, which is compounded by the stochasticity of smaller numbers at finer spatial scales29. This can be a consequence of spatially clustered testing, which is particularly affected by targeted testing strategies30 and spatially heterogenous institutional testing. However, the age-severity infection gradient could also lead to an expected divergence between the levels of test positivity and the number of observed clinical outcomes due to the temporally varying case composition. This emphasizes the importance of using epidemiological outcomes that are relevant to the public health policy context and the requirement for early intervention on the basis of the characteristics of novel variants in circulation.

Impactful syndromic surveillance of SARS-CoV-2 requires symptomatic prevalence, healthcare-seeking behaviour, mobility patterns and testing demand to be monitored to map and understand a widespread communicable disease. The interaction of healthcare-seeking behaviour through internet engagement31 has been shown to be important to understand disease transmission. However, it is the application of this approach (specific to COVID-19 symptomatology) to small spatial scales that has yet to be explored for its potential to function as an early indicator of an outbreak. We evaluated the suitability of Google Trends, Google mobility, telecoms mobility, National Health Service (NHS) Pathways 119 calls and website testing demand as predictive features for modelling outbreaks. We have then employed multiple machine learning models, with sensitivity analysis of temporally lagged features, to assess their predictive performance at capturing localized outbreaks of COVID-19. We have designed a neural network for spatial data, which we have termed a spatio-integrated long short-term memory (SI-LSTM) and a spatio-integrated convolutional long short-term memory (SI-CNN-LSTM) algorithm.

Leading indicator analysis

We conducted an appraisal of the leading indicators for population-normalized polymerase chain reaction (PCR)-positive tests, hospitalizations and deaths from COVID-19 at local authority district (LAD) in the UK. Our preliminarily investigation assessed the feasibility of primary healthcare, social care and secondary healthcare data sources. This included general practitioner calls, clinical staff absences, care home incidence reports and school absences. We further explored 1,108 COVID-19-related Google Trends web search terms, the NHS COVID-19 mobile application32, Google mobility, telecoms mobility, COVID-19 test request website journeys and NHS Pathways 119 calls. Leading indicators were assessed using generalized additive models with a negative binomial error structure and dynamic time warping. Data sources were excluded if they did have full geographic coverage, could not be sourced in a timely fashion (that would be relevant for an operational public health response), or were found to lag or be concurrent with the clinical target variables. Results indicated that the most consistent leading indicators across the epidemic phases were 94 Google Trends terms, telecoms mobility, Google mobility, website test request journeys and NHS Pathways 119 calls.

The Google Trends data were collected hourly for all four nations of the United Kingdom, resulting in data for 4,013 locations; they were scaled to LAD geography using the latitude and longitude coordinates provided by Google to map to the Office of National Statistics (ONS) boundaries33. Mobility data were collected from Google6 and telecoms operators5, where it is reported at the LAD and middle layer super output area (MSOA)33, respectively. The data are prepared by mapping to the LAD level using the ONS lookups34 by extracting, among other things, demographic and person category (resident, worker, visitor) information. Website journey test request data were sourced from the Test and Trace Adobe Analytics platform, which measures both symptomatic and asymptomatic journeys through the test booking system. The data are further broken down by whether the journey was complete or incomplete at the final stage. Testing availability was defined as individuals that complete the online journey until the final stage at which they are offered a test and could not proceed relative to individuals that completed the website journey. Adobe geolocates requestors on the basis of their internet protocol and a lookup table was created to aggregate the Adobe locations to LAD level. The 119 number was established as the contact number for the NHS Test and Trace service in May 202035, and provides a way to book a COVID-19 test and enquire about a test result; its scope has since expanded to process vaccination appointments. As with the other sources mentioned above, the dataset was aggregated to LAD geography using an ONS lookup table34. Only two types of call were selected: calls in which ‘Test enquiry—request a test’ was given as the call reason, and all calls, regardless of reason.

Modelling outbreak risk

An outbreak risk scoring system was developed for population-normalized COVID-19-positive PCR cases, hospitalizations and mortalities. The PCR-positive case data were sourced through the anonymized combined list collected by the UK Health Security Agency (UKHSA), which is derived from the National Pathology Exchange dataset36. The hospitalization data were obtained from the admitted patient care (APC) dataset37, which include individuals that tested positive for COVID-19 fifteen days prior to and eight days post admission, and was aggregated from the lower super output areas to the LAD level. Mortality data were obtained from the UKHSA COVID-19 death linelist for England, and the public dashboards for Scotland38 and Northern Ireland39 (note that we did not have access to mortality data at LAD geography for Wales). The PCR testing and mortality data that were included for analysis had been evaluated for backfilling (that is, how long it takes before the last complete day of data) over the most recent seven day period prior to inclusion as a target. The hospitalization APC data have defined monthly periods when hospital trusts must declare their admission activity data and the last complete day was included. The daily PCR tests, hospitalizations and mortality data for each LAD were normalized per million and smoothed over a rolling seven day window. The defined thresholds represent equal proportions of these distributions at LAD for a defined temporal window of the epidemic in the UK. The risk score criteria are dynamic and determined by changes to the daily proportions in cases, hospitalizations and deaths, which are influenced by variant severity, availability of testing within a country, the ascertainment rate and the rate of disease prevalence to be informative indicators of inter-location heterogeneity.

Preliminary univariate analysis was conducted for the risk score targets of COVID-19, using an autoregressive integrated moving average (ARIMA) model fit using a modified Hyndman–Khandakar algorithm at epidemic phase change points. Shallow learning algorithms (Random Forest40, XGBoost41, GBM42 and Naïve Bayes42) were trained on the leading indicator features, which were lagged from 15 to 40 days relative to risk score target. We did not forecast for greater than these periods as preliminary analysis indicated that model performance quickly deteriorated after 40 days. Random holdouts of up to 40 days were excluded across the epidemic phases to assess the performance of the models. K-fold cross-validation was included for each model (k = 10) in addition to a primary model that was trained on the entire training dataset. Eleven models were thus trained on the data: ten on each cross-validation split and the primary model on all of the training data. The trained models were then stacked to create an ensemble model using the XGBoost algorithm43. The stacking comprises training a second-level learner called a meta-learner to optimize the performance of the base learners.

We developed deep learning algorithms to enhance the algorithm learning from the geospatial data, which have been termed an SI-LSTM (Fig. 1) and SI-CNN-LSTM (Fig. 2) algorithm (please see the Methods for further details). The SI-CNN-LSTM architecture takes advantage of the feature amplification ability of convolutional neural network layers to use a type of weight sharing with local perception to refine and condense the number of parameters that helps to improve the learning efficiency for the LSTM layers44. These models were developed using bespoke generator functions45 for the LAD time-series in the UK and yielded lagged batches of the features for the target variables. The model features were pre-processed using a log transformation to stabilize the variance, and subsequently normalized so that the mean was zero and the standard deviation was one. Due to the mobility data containing negative values, we employed an offset value before log transformation to ensure that the step produced a real value. This is conducted to speed the process to the global minima of the error surface and mitigate the chance of getting stuck at local optima. The model targets were one-hot encoded to convert the categorical input data into a vector required for the categorical cross-entropy loss function46.

Fig. 1: The architecture of the SI-LSTM algorithm.
figure 1

The features from each location are fed in as inputs along network branches that contain LSTM layers, time-distributed dropout layers and a dense layer, producing a side output. The tensors are further concatenated to produce the main model output for each area.

Fig. 2: The architecture of the SI-CNN-LSTM algorithm.
figure 2

The features from each location are fed in as inputs along network branches that contain time-distributed one-dimensional convolution layers, a time-distributed max pooling layer, a time-distributed flatten layer, LSTM layers, and dense and dropout layers, producing a side output. The tensors are further concatenated to produce the main model output for each area.

The final model architecture included a seven day lookback to capture the weekly trend in the features, a shuffling in the order of the training data, and a decrease in the learning rate for subsequent epochs if an increase in the validation loss was detected. In the final layers of the SI-LSTM and SI-CNN-LSTM, we introduced a connection network between all of the geographic locations so that the model performance can be optimized through intra- and inter-location feature weighting. The 363 independent input branches are merged through combining the list of tensors from the final LSTM layer for each location on a single concatenation axis, which produces a single tensor as described in Fig. 3. The final LSTM layer produces a rank-2 tensor of shape (b, u), where b is the batch size and u is the number of units in the LSTM layer. After concatenation of tensors from L locations, the resulting tensor has shape (b, Lu). The final dense layer has a softmax activation function to ensure that the output vector yi{, …, C} over C classes is normalized, and that yi can be interpreted as the probability that the target is class i. The cross-entropy loss function is then defined as:

Fig. 3: The geospatial connection network of the spatially integrated layers across LADs in the UK.
figure 3

A merging of the 363 independent input branches by combining the list of tensors, from the final LSTM layer for each location, on a single concatenation axis.

$${{{\mathcal{L}}}}\left( {y_i,t_i} \right) = - \mathop {\sum}\limits_{i = 1}^C {t_i\log y_i}$$

where ti is the one-hot encoded target vector. We then used RMSprop as the optimization function in the back-propagation stage.

Results

Univariate forecasting

Univariate ARIMA modelling, using a modified Hyndman–Khandakar algorithm47 for step wise performance tuning, was conducted using PCR-positive cases, hospitalizations and mortalities from COVID-19 (Extended Data Fig. 1). We can observe that the ARIMA models struggle, particularly at change points in the epidemic wave, to reliably predict the growth trajectory. This is particularly pronounced in the pre-exponential phase, exponential phase and at the turning point of an epidemic peak, which is evidenced by the LAD model results in Extended Data Table 1, in which the models struggled to reach an accuracy of 50% across the Alpha wave.

Feature importance

To assess feature importance, we used a Random Forest algorithm, with random temporal holdouts, across the different feature groups included (Extended Data Table 2). The most important feature group for all tasks was Google Trends, followed by Telecoms Mobility, Google Mobility, Website Testing Demand and 119 Calls. A full statistical description of the included model features can be found in Supplementary Table 1. This performance is indicative of the periods assessed in this paper, feature importance has evolved across the COVID-19 epidemic in the UK and has been influenced by extrinsic pressures such as NPIs, changes in testing behaviour/policy, and novel variant patterns of growth (the feature importance for each epidemic phase and variant can be seen in Supplementary Figs. 13).

Google Trends at LAD

The search terms that received highest relative volumes scores across the research period can be seen in Fig. 4. Variations on requests for COVID-19 tests have the highest overall volume observed at LAD level in the United Kingdom. The highest-volume entity terms observed for COVID-19 are ‘sore throat’, ‘cough’, ‘fatigue’, ‘fever’, and ‘shortness of breath’. Although absolute volume is of interest to maintain relevant and timely search terms, it is not necessarily reflective of their overall feature importance in the model spatially and temporally. It is how the terms interact and the auxiliary data that determine their importance for outbreak detection.

Fig. 4: The search terms that received highest relative volumes.
figure 4

A bar chart of the top-30 search terms from the Google Trends data collection in the United Kingdom at LAD, measured by RSV.

Spatio-temporal modelling

The SI-CNN-LSTM and SI-LSTM algorithms performed better across all temporal periods and for every target relative to shallow learning algorithms assessed (Fig. 5). The greatest performance differential was observed for the mortality risk scores, where the SI-LSTM saw an improvement of up to 15% relative to the best shallow learning algorithm. There was a clear performance improvement from the geospatial concatenation that can be observed in the main-output accuracy relative to the side-output accuracy seen in Extended Data Fig. 2. The peak temporal performance across the modelled targets was observed in the SI-LSTM: 99.4% accuracy for the case risk score (feature lag of 25 days), 96.3% for the hospitalization risk score (feature lag of 40 days) and 84.8% for the mortality risk score (feature lag of 25 days).

Fig. 5: The performance of the SI-CNN-LSTM and SI-LSTM algorithms across all temporal periods.
figure 5

A line graph of the model accuracy for confirmed SARS-CoV-2 case, hospitalization and mortality risk scores for the shallow- and deep-learning algorithms across the temporal delay periods.

In the deep learning models, training and validation loss convergence was observed after around 20 epochs for case risk scores, 15 epochs for hospitalization risk scores, and 30 epochs for mortality risk scores. The SI-LSTM overall performs slightly better on the test data than SI-CNN-LSTM, and convergence is reached after fewer epochs (Extended Data Fig. 3). Earlier model architectures encountered volatility in the validation loss, which was resolved by decreasing the learning rate and increasing the batch size. Moreover, by providing a dynamic learning rate during training, we found a smaller value of around 0.001 generally produced optimal model convergence. Sensitivity analysis found that the optimizer function RMSprop performed better than stochastic gradient descent and Adamax, as seen in Extended Data Fig. 4. We also found that a larger tensor shape for each LSTM layer—corresponding to 128–160 units—produced higher validation accuracies.

Analysis of the shallow learning models found the highest overall performance of 95.3% on the case risk score for the XGBoost followed by the Ensemble, Random Forest and GBM, which had only slightly reduced accuracy on the test data. The Naïve Bayes model, by contrast, did not perform well for longer projection periods and mortality risk. The log loss across each temporal period and shallow learning algorithm can be seen in Extended Data Fig. 5. The distinction in performance between the XGBoost, Random Forest and Ensemble relative to the GBM is slightly more pronounced. The shallow and deep learning models performed better on the case risk score and the hospitalization risk score targets. However, all models saw diminished performance on the mortality risk score, which may be related to the increased relevance of the features for capturing transmission in younger demographic groups.

Sensitivity analysis was conducted on the hyperparameters of the XGBoost, Random Forest and GBM algorithms. The number of trees, tree depth and learning rate were varied to establish whether performance could be further optimized. The outcome of this analysis established that for the GBM and XGBoost (Extended Data Fig. 6), providing that the number of trees exceeded 1,000, the tree depth 10 and the learning rate 0.01, the performance was relatively insensitive to the hyperparameters. Moreover, that the Random Forest model performed optimally with a max tree depth of 5 and when the number of trees exceeded 500.

Public health operational model outputs

For the purpose of epidemic management, the early spatial identification of the pre-exponential and exponential change points—prior to their identification through traditional epidemiological surveillance—is important for an effective response to outbreaks of novel variants of concern.

In December 202048 the outbreak of the Alpha variant in England began by clustering around the county of Kent. During this period England was also experiencing substantial growth in COVID-19 (D614G mutant of SARS-CoV-2), which had precipitated a lockdown in November 202049. On the 1st of November 2020, the modelling identified the exponential growth of the Alpha variant in Kent (Fig. 6) prior to the identification through sequenced PCR tests. The observed lack of testing availability identified through website test requests in the Alpha wave is noteworthy and may have masked the identification of increased case rates in some local authorities.

Fig. 6: Maps of England showing confirmed case risk predictions, sequenced cases and test availability data for the Alpha and Omicron variants.
figure 6

In the top panel: sequenced Alpha cases population-normalized per 100,000 averaged from the 20th of November to the 10th of December 2020; confirmed case risk predictions from features on the 1st of November 2020 trained to forecast for up to 30 days; testing availability over the training period. In the bottom panel: sequenced Omicron cases population-normalized per 100,000 averaged from the 5th to the 10th of December 2021; confirmed case risk predictions from features on the 20th of November 2021 trained to forecast for up to 20 days; testing availability over the training period.

The recent outbreak of the Omicron BA.1 variant was initially identified in late November 2021 in England50. The clustering of this variant around London and the South East region was detected through the modelling of leading indicator features from 20th November 2021 (Fig. 6). This was identified in the background of a high prevalence rate for the Delta variant and at this time there had been only eight confirmed sequenced PCR cases of Omicron BA.1 in England. The unprecedented wave of incidence that was observed in December 2021 necessitated a higher tiering in the case data, which can be seen in Supplementary Fig. 4.

Discussion

The heterogeneous nature of the COVID-19 epidemic, being characterized by localized outbreaks, presents challenges for public health policy in that certain areas may warrant more substantial interventions to contain the spread of SARS-CoV-2. The aim of this modelling approach is to provide policy-makers with an early indicator syndromic surveillance framework for local areas which, when combined with other lines of reporting, can aid in pandemic management. This localized focus has become increasingly more important as importations of SARS-CoV-2 variants of concern become the focus of outbreak response51,52. We have illustrated, akin to the literature on other communicable diseases31, that RSV data can be of utility in understanding transmission hotspots when the terms are carefully selected, and further clinical and non-clinical data are included in model development.

The SI-LSTM geospatial architecture design allowed for specific intra-location learning while also benefitting from inter-location information sharing. This model architecture achieved the highest overall performance of above 99% accuracy on the unseen data for the case risk score at the local authority level in the UK. We found that a smaller learning rate and larger batch size were important in reducing validation loss volatility, despite research that LSTMs work well with larger learning rates53, because they push the output gate to zero. The inclusion of convolutional neural network (CNN) layers and regularization in the dense layers produced comparable performance for each temporal delay period assessed in this paper. We discovered in early model development that the performance of the SI-CNN-LSTM and SI-LSTM models were more improved relative to the shallow learning algorithms with a longer time-series of training data; therefore, when dealing with a shorter time-series there may be a preference towards a shallow learning algorithm approach.

The willingness or ability to opt into the testing system54 substantially impacts insights from conventional epidemiological data for epidemic surveillance. The motivation to seek or report a test has been found to be related to symptom severity and a lack of understanding with regards to the main symptoms of COVID-19, which has been observed to a greater extent in older age groups55. This will be further impacted by socio-economic vulnerabilities, the ability to acquire a test and location feasibility. Due to the vulnerability of the confirmed case risk score model target to this ascertainment bias, we normalized positive test counts and defined epidemiologically important ranges that would be more robust to these fluctuations. We further adjusted the model target ranges to reflect the spatio-temporal variation in testing availability and observed that the inclusion of testing availability as a model feature improved performance for some local authorities. In locations that have limited testing coverage, particularly relevant as public health policy evolves in response to the pandemic, the modelling framework proposed may be better suited to the further clinical targets of COVID-19 infection included in this study.

The study found mobile and telecoms mobility data to be a robust predictive feature of the increased transmission of SARS-CoV-2. The novel application of this data to disease modelling in the COVID-19 pandemic has allowed for a greater understanding of movement patterns that can help to identify locations of concern, importations between local authorities and behavioural responses to the easing of NPIs9. However, the importance of the mobility data as a leading indicator evolves with the temporal epidemic phases and extrinsic factors. In later periods when NPIs were more limited, the mobility data, in isolation, were a better predictor of transmission when the virus showed patterns of endemicity. Models that have been developed56 to primarily focus on mobility proxies may therefore be limited in their ability to accurately capture novel variant growth. This can be explained by fluctuations in transmission being determined by mobility patterns when a variant is more established and growth is more stable, but this data independently will have less utility at recognizing the pre-exponential and exponential phase growth from the introduction of a new variant, particularly if pre-pandemic contact patterns have returned. However, this study finds that—in combination with proxies of symptomatic prevalence—mobility data can be an effective leading indicator across the epidemic phases.

For the use of Google RSV data in an operationally relevant environment, it is essential to monitor the relative frequency of the terms (see Extended Data Fig. 7) as behaviours57,58 and worldwide government directives evolve over the course of the pandemic. This is to preclude monitoring terms that are no longer relevant as healthcare-seeking habits change59 or those that are likely to be unduly driven by extrinsic pressures such as media reporting60, and to capture novel behaviours that may be important. Novel variants have presented diverse symptomology profiles61 and therefore it is important to keep a broad spectrum of symptoms included in the data collection. Further research of Google RSV data at the local authority level should investigate locations for post-acute COVID-19 (long COVID62) in areas disproportionately impacted and that have had stubbornly high transmission of COVID-19. Moreover, there may be further insight gleaned from the use of this data to assess the impact on mental health between locations that have been under longer-term local lockdowns63.

Digitalized web-based data sources (Google Trends, and Test and Trace website test requests) included in the analysis have a bias towards younger-aged demographics. However, these groups were the focus of the analysis, because an epidemic wave of a respiratory infection such as COVID-19 is predominantly driven by younger age groups (<65 years old), which have higher effective contact rates64,65. Moreover, further research has identified that resurgent epidemic waves of the SARS-CoV-2 virus have been driven largely by working-age adults66 and that the 18–39 age group led the replacement of Delta by Omicron BA.1 (ref. 67) in the UK. A preliminary assessment of leading indicators from primary health and social care data sources that exclusively target the oldest age groups were found to have limited geographic coverage in the UK, were difficult to source in an operationally useful manner, and found to lag community transmission. The 119 telephonic requests for PCR and lateral flow tests included in the modelling were found to have a slightly older age composition relative to online test requests, which may have aided in the identification of increased transmission for these ages.

The difficulty of identifying change points in an epidemic curve has been a consistent modelling challenge across the pandemic68. This has been frequently observed for widely developed transmission models69,70,71,72 that are reliant on historical data to fit the model and transmission simulations of prescribed parameters (which are difficult to quantify70) to develop projections. The parameter space for transmission models evolves for each new variant, with the collection of data required to update these parameters too lagged for early epidemic management. For instance, the estimates required for the generation time, serial interval, incubation period and the time to a clinical event2,73 have required usually, at minimum, a month or longer for an adequate sample to be collected from contact tracing. Different choices in these parameter spaces lead to a great divergence in the modelling projections fit to the same data. Machine learning approaches74,75,76 and statistical forecasting models77 that are univariately trained on confirmed cases are limited in an operationally responsive space to provide a meaningful window for interventions as they will struggle to identify a signal until incidence is in clear exponential growth or decay. This will be further compounded by confirmed tests being a lagged indicator of increased incidence, which is exacerbated by times of heightened ascertainment bias. Moreover, at a small spatial scale, models trained solely on case data will suffer with a great deal of false signals especially if confirmed cases are not adjusted for some measure of testing availability or the rate of ascertainment.

We propose a novel modelling approach that has been developed for public health response organizations that has wider relevance for modelling outbreaks of COVID-19 outside of the United Kingdom. This study is designed to provide a modelling framework and data sources that can be effectively employed to create early warning indicators of changes in transmission and to project the hospital and mortality burden at small spatial scales. The defined modelling approach is designed to be adaptable to different stages in the pandemic and the risk scoring system should be tailored to the current rate of prevalence and the severity profile of a variant for a specific population. This approach focuses on trends and changes in those trends that would provide spatial insights on a novel outbreak and the epidemic trajectory.

Conclusion

Timely and well-informed syndromic surveillance is essential to inform effective public health policy over the SARS-CoV-2 pandemic. The monitoring of traditional clinical indicators can be lagged and misleading, which hampers efforts to identify hotspot localities. We have coalesced the most meaningful leading indicator data currently available in the UK to identify local authorities of concern. The models described are used as part of the UK’s coordinated response to the COVID-19 pandemic with a suite of other data sources to inform public health policy and identify areas with concerning levels of transmission.

This study found that the SI-LSTM algorithm design was able to, for the assessed temporal periods, accurately predict hotspot locations over time horizons of a month or more with a high degree of accuracy. The novel architecture described in this paper provides a framework for modelling temporally variable geospatial data. We anticipate that this model architecture has uses beyond the epidemiological application described in this paper.

In public health operational use, the models accurately forecast the exponential increase in the Alpha variant in December 2020, the Delta Variant in April 2021 and the Omicron Variant in November 2021 within local authorities in the UK. The evolution of the pandemic may render certain data sources less important for modelling purposes and, due to extrinsic pressures, modelling RSV trends must be conducted with careful design, relevant auxiliary features and meaningful clinical targets.

Methods

The section will first outline the steps taken to collect and prepare the data sources for modelling. The development of the models is described at the end of this section.

Data collection and preparation

Google Trends

Google Trends data provides RSV by search term and location over time which can be accessed via the public website trends.google.com. The data are normalized by total search volume78, and reflect the relative importance of terms over time and space. Both national- and highly localized city-level data were analysed for this work. The city-level data can be found under the ‘Interest by city’ panel within the user interface. We collected hourly RSVs for all four nations of the United Kingdom, resulting in data for 4,013 locations.

The project had the support of Google Trends’ editorial team throughout the project, whom facilitated data acquisition and provided a Google Health Trends API key. A preliminary analysis was conducted on the daily relative values provided by Google for each city location. The daily relative value was found to be of limited utility due to the high proportion of zero values reported. Further exploration discovered that data collected at an hourly frequency resolved this issue. We therefore executed hourly requests to collect the Google search trends data.

At the outset of the project, the collection included 108 terms to capture the most frequently observed symptoms of COVID-1979, NHS medical advice seeking behaviour, COVID-19 testing, and common over-the-counter treatments for COVID-19. These terms were supplemented with a further 1,000 search items found to be the most commonly employed phrases used in NHS Pathways 111 telephonic COVID-19 triages80. We excluded certain words and phrases for their lack of overall relevance in the context of a search term and their relative occurrence at a national level in the Google Trends user interface. Preliminary analysis conducted at a national level involved generalized additive models with a negative binomial error structure and dynamic time warping to assess the selected terms’ relevance as a predictive feature of COVID-19 incidence and clinical outcomes. The analysis highlighted 94 important terms that were relevant for further analysis and seven primary symptoms of COVID-19 were included as Google entity terms.

The Google data were then processed to match geographically, by date, the recorded SARS-CoV-2 case, hospitalization and mortality data at LAD. Google estimates search locations using sources including the GeoIP and, where available, the GPS coordinates of the device81. Lookups were therefore developed using the latitude and longitude provided by Google to map the data to the ONS33 designated LAD geographies. This was not possible for central London and as a result a group of LADs was created to match Google’s London location.

Mobility data

Mobility data were collected from Google6 and telecoms operators5 where it is reported at LAD and MSOA33, respectively. The Google mobility data measures change in the visits and length of stay at six different place categories compared with a baseline period between the 3rd of January and the 6th February 20206. The categories are grocery and pharmacy, parks, transit stations, retail and recreation, residential, and workplaces. Locations provided—based on the ISO 3166 standard—are ‘country_region_code’, ‘sub_region_1′ and ‘sub_region_2′. The telecoms mobility data contain counts of the number of people and their number of journeys over time at MSOA geography. The data are prepared by mapping to LAD using the ONS lookups34 by extracting, among other things, demographic and person category (resident, worker, visitor) information. The absolute numbers in this dataset are challenging to interpret, but, as with other sources presented in this paper, it is the trends rather than absolute numbers that are important.

Website COVID-19 testing journey data

Website COVID-19 testing journey data were sourced from the Test and Trace Adobe Analytics platform, which measures both symptomatic and asymptomatic journeys through the test booking system. The data are further broken down by whether the journey was complete or incomplete. An incomplete booking journey is one in which a person does not proceed at the final stage of the online journey to book the test. Adobe geolocates requestors on the basis of their internet protocol and a lookup table was created to aggregate the Adobe locations to LAD level.

NHS Pathways 119 Data

The 119 number was established as the contact number for the NHS Test and Trace service in May 202035 and provides a way to book a coronavirus test and enquire about a test result; its scope has since expanded to process vaccination appointments. The dataset includes the call date and reason along with the geographic location of the caller. As with the other sources mentioned above, the dataset was aggregated to LAD geography using an ONS lookup table34. Only two types of call were selected: calls in which ‘Test enquiry—request a test’ was given as the call reason and all calls, regardless of reason.

Testing availability

The greatest quantity of diagnostic tests conducted for COVID-19 are through website requests. Testing availability was defined as individuals that complete the online journey until the final stage at which they are offered a test and could not proceed relative to individuals that completed the website journey. This may be due to lack of available RT-PCR tests, because the testing centre location was not accessible, or the requestor chose not to proceed.

Due to the temporal and geographic disparity over testing availability throughout the pandemic we calculated testing availability, as a function of location l and time t. A completion denotes an individual that finished the website test request journey and that a test was conducted. It is defined by the following equation:

$$\begin{array}{l}{{{\mathrm{Availability}}}}\,(l,t) \\ = \frac{{{{{\mathrm{Symptomatic}}}}\,{{{\mathrm{completions}}}}\,(l,t) + {{{\mathrm{Asymptomatic}}}}\,{{{\mathrm{completions}}}}\,(l,t)}}{{{{{\mathrm{Symptomatic}}}}\,{{{\mathrm{final}}}}\,{{{\mathrm{stage}}}}\,{{{\mathrm{users}}}}\left( {l,t} \right) + {{{\mathrm{Asymptomatic}}}}\,{{{\mathrm{final}}}}\,{{{\mathrm{stage}}}}\,{{{\mathrm{users}}}}\,(l,t)}}\end{array}$$

Availability (l, t) = 1 corresponds to an area where all of those who request a test receive one,

Availability (l, t) = 0 corresponds to an area where testing is entirely unavailable on request

Testing availability was employed as a feature when modelling the case rates for a locality. Testing data coverage is heterogeneous, and the ascertainment bias is time varying therefore, for the operational presentation of modelling results that were trained on case data we included testing availability scores to understand gaps in local coverage that the model may not identify.

Outbreak risk score

The primary purpose of this modelling approach is to highlight areas of concern before a substantial outbreak occurs within a LAD. An outbreak risk score was therefore developed for confirmed SARS-CoV-2 PCR-positive cases, hospitalizations and mortalities (Supplementary Fig. 5). The PCR-positive case data were sourced through the anonymized combined list collected by the UKHSA, which is derived from the National Pathology Exchange dataset36. The hospitalization data were obtained from the APC dataset37, including individuals that tested positive for COVID-19 fifteeen days prior to and eight days post admission, and was aggregated from the lower super output area to the LAD level. Mortality data were obtained from the UKHSA COVID-19 death linelist for England, and the public dashboards for Scotland38 and Northern Ireland39 (we did not have access to mortality data at LAD geography for Wales).

The PCR testing and mortality data that were included for analysis had been evaluated for ‘backfilling’ (how long it takes before the last complete day of data) over the most recent seven day period prior to inclusion as a target. The hospitalization APC data has defined monthly periods when hospital trusts must declare their admission activity data and the last complete day was included. The daily PCR tests, hospitalizations and mortality data for each LAD was normalized per million and smoothed over a rolling seven-day window.

The thresholds for the risk scores were determined by analysis of the population-normalized daily distribution of cases, hospitalizations, and mortalities, at LAD. The defined thresholds represent equal proportions of these distributions at LAD for a defined temporal window of the epidemic in the UK. These thresholds were, in a public health operational response setting, initially informed by the localized interventions in the United Kingdom through the tiering system47. The risk score criteria are dynamic and determined by changes to the daily proportions in cases, hospitalizations and deaths, which are influenced by variant severity, availability of testing within a country, the ascertainment rate, and the rate of disease prevalence to be informative indicators of inter-location heterogeneity.

Model development

The data used for analysis in this work were collected from the 1st of October 2020 and the model performance was measured up to July 2021. The software used for model development included Python v.3.10.0 and R v.4.2.0. The targets for the machine learning modelling were defined as the daily confirmed case risk score, hospitalization risk score and the mortality risk score. The features used for the machine learning modelling included Google Trends search data, Google mobility, telecoms mobility, NHS Pathways 119 call categories, testing availability, location, and asymptomatic and symptomatic website testing request journeys. The features, analogous to the targets, were smoothed over a rolling seven-day window due to the erratic nature of this time-series data when analysed daily. For our modelling purposes, and its operational use case, we sought to identify trends and not the precise value on a given day to highlight an area of concern.

Time-series analyses of the data was conducted using shallow learning and deep learning algorithms and the features were lagged relative to the target from 15 to 40 days to assess their predictive temporal relationship with the clinical indicators. Forecasting was not attempted for longer than these periods as preliminary analysis found that model performance quickly deteriorated after 40 days. This project ran a total of 2,057 models including the sensitivity analysis of hyperparameters.

Univariate forecasting

To understand the difficulty of the predictive task and where the proposed models are likely to struggle, a univariate forecasting approach was developed for population-normalized cases, hospitalizations and mortalities at the LAD level. An ARIMA model was fit using a modified Hyndman–Khandakar algorithm82 for step wise performance tuning using unit root tests and the Akaike information criterion. Model performance was further measured by the risk scoring criteria developed for cases, mortalities and hospitalizations.

Shallow learning

Model design

With the features lagged from 15 to 40 days, we trained Random Forest40, XGBoost41, GBM42 and Naïve Bayes42 algorithms on the risk score target. Log loss was the defined loss metric for the Random Forest, XGBoost and GBM with a stopping tolerance of 0.001 (full model hyperparameter specifications can be found in Supplementary Table 2). Random holdout outs of up to 40 days of data were excluded from the training sample and used to assess model performance. K-fold cross-validation was also included for each model (k = 10) in addition to a primary model that was trained on the entire training dataset. Eleven models were therefore trained on the data: ten on each cross-validation split, and the primary model on all of the training data. The trained models were then stacked to create an ensemble model using the XGBoost algorithm43. The stacking comprises of training a second-level learner called a meta-learner, which combines the base learners to optimize performance.

Feature importance and sensitivity analysis

Sensitivity analysis was conducted to find the optimal hyperparameter combinations for each shallow learning algorithm across the assessed temporal periods. This included the tree depth, number of trees and the learning rate. To illustrate the relative importance of each data source at predicting the risk score targets, a Random Forest algorithm was trained on each source’s features in turn and the performance was evaluated. We measured the performance at a 15 day lag in the features for the PCR-positive case, 20 day lag for hospitalization and a 25 day lag for mortalities. The delays were selected as the optimal performance periods of the Random Forest algorithm. The results provided are the overall performance across the assessed periods however, these relationships change across epidemic phases. Therefore, feature inportance was assessed across every epidemic phase for each replacing variant of SARS-CoV-2 using an XGBoost algorithm.

Deep learning

In the following section we discuss the data pre-processing for the deep learning algorithms, the preliminary sensitivity analysis, and the final model architectures.

Data pre-processing

The model features were pre-processed using a log transformation to stabilize the variance and subsequently normalized, so that the mean was zero and the standard deviation was one. Due to the mobility data containing negative values we employed an offset value prior to log transformation to ensure that the step produced a real value. This is conducted to speed the process to the global minima of the error surface and mitigate the chance of getting stuck at local optima. The model targets were one-hot encoded to convert the categorical input data into a vector required for the categorical cross-entropy loss function46.

The model utilized a generator function45 for every LAD and yielded lagged batches of the features for the target variables. The arguments of the generator function included:

  • Lookback (how many time steps of features to include for each target)

  • Lag (how many time steps in the past are the features relative to the target)

  • Shuffle (whether to shuffle the order of the training data)

  • Batch size (how many samples are used per batch)

  • Minimum and maximum indices (the portion of the overall time-series to use for each location)

Preliminary analysis

Preliminary exploratory analysis was conducted on the defined lookback period, shuffling of the training order, the number of LSTM and CNN layers, L1 and L2 regularization on dense layers, the shape of the tensor for each layer, and the use of dropout layers. We also assessed the relative impact of different optimization functions: RMSprop83, stochastic gradient descent84 and Adamax85.

Model design

The final model design included a seven day lookback on the delay periods 15, 20, 25, 30, 35 and 40 days. This determined that the algorithm would, for a target on a given day, utilize the past seven days of features. This was included to capture the weekly trend in the features for a defined risk score of confirmed SARS-CoV-2 cases, hospitalizations or mortalities. Following the sensitivity analysis, we included a shuffling in the order of the training data and developed a model structure that allowed the learning rate to decrease for subsequent epochs if an increase in the validation loss was detected, which is a proxy metric for overfitting.

At the final layers of the SI-LSTM and SI-CNN-LSTM we introduced a connection network between all geographic locations so that the model can learn from the intra and inter-location feature weighting. We merge the 363 independent input branches by combining the list of tensors, from the final LSTM layer for each location, on a single concatenation axis and to produce a single tensor as described in Fig. 3. The final LSTM layer produces a rank-2 tensor of shape (b, u) where b is the batch size and u is the number of units in the LSTM layer. After concatenation of tensors from the L locations, the resulting tensor has shape (b, Lu).

The final dense layer has a softmax activation function, which ensures that the output vector yi{1,…,C} over C classes is normalized and that yi can be interpreted as the probability that the target is class i. The cross-entropy loss function is then defined as:

$${{{\mathcal{L}}}}\left( {y_i,t_i} \right) = - \mathop {\sum}\limits_{i = 1}^C {t_i\log y_i}$$

where ti is the one-hot encoded target vector. We then used RMSprop as the optimization function in the back-propagation stage.

SI-LSTM

The model has an initial input layer for each location followed by two LSTM layers with a time distributed dropout layer, which helped to prevent overfitting in the early model epochs. There is a final LSTM layer before the model forks, as seen in Fig. 1, to produce a dense side-output layer for each location and a concatenation layer followed by a dense layer. The final output layers have a softmax activation function due to the probabilistic categorical cross-entropy loss function.

SI-CNN-LSTM

The SI-CNN-LSTM architecture takes advantage of the feature amplification ability of CNN layers to use a type of weight sharing with local perception to refine and condense the number of parameters that helps to improve the learning efficiency for the LSTM layers44. Due to the dimensional size of the features after the one-dimensional CNN layers, a time-distributed dropout layer, a one-dimensional max pooling layer and a flatten layer are included. The model structure then includes three LSTM layers, with the first LSTM layer followed by a dropout layer and a dense layer, with a further dropout layer on the second LSTM layer. The model then branches out to a dense side-output layer and a concatenation layer before the final dense layer.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.