Assessing the value of integrating national longitudinal shopping data into respiratory disease forecasting models

The COVID-19 pandemic led to unparalleled pressure on healthcare services. Improved healthcare planning in relation to diseases affecting the respiratory system has consequently become a key concern. We investigated the value of integrating sales of non-prescription medications commonly bought for managing respiratory symptoms, to improve forecasting of weekly registered deaths from respiratory disease at local levels across England, by using over 2 billion transactions logged by a UK high street retailer from March 2016 to March 2020. We report the results from the novel AI (Artificial Intelligence) explainability variable importance tool Model Class Reliance implemented on the PADRUS model (Prediction of Amount of Deaths by Respiratory disease Using Sales). PADRUS is a machine learning model optimised to predict registered deaths from respiratory disease in 314 local authority areas across England through the integration of shopping sales data and focused on purchases of non-prescription medications. We found strong evidence that models incorporating sales data significantly out-perform other models that solely use variables traditionally associated with respiratory disease (e.g. sociodemographics and weather data). Accuracy gains are highest (increases in R2 (coefficient of determination) between 0.09 to 0.11) in periods of maximum risk to the general public. Results demonstrate the potential to utilise sales data to monitor population health with information at a high level of geographic granularity.


Introduction
Between 2015 and 2019, respiratory disease was the underlying cause of 369,900 deaths across England and Wales [1] 1 .By 2020, and the arrival of the global pandemic, respiratory disease had become the leading cause of death [1], with COVID-19 listed on over 200,000 UK death certificates to date [2].With COVID-19 appearing to have established itself as a dominant and long-standing disease [2,3], and with public health services simultaneously facing increasing financial and logistical pressures, it has become more critical than ever to investigate better forecasting of Influenza-like Illnesses (ILI) and their likely impact on local and vulnerable populations.
Response to COVID-19 also served to highlight the limitations faced by classical disease modelling in supporting prediction efforts.In traditional forecasting models, individuals are assigned to just one of three states (susceptible-infected-recovered) and population behaviours are broadly assumed to be both homogeneous and static [4].Evidence from the pandemic highlighted not only the key role micro-scale interactions between individuals play in disease transmission, but the highly dynamic and evolving nature of those interactions amidst social, economic and behavioural shifts within the population, especially at local levels [5].In reaction to this, there have been renewed calls to directly integrate social and behavioural data into disease models [4,5,6,7,8,9], with the aim of producing more nuanced and adaptive forecasting mechanisms.Data that can provide insights into the evolving nature of population behaviours, from mobility patterns to nutrition and self-medication strategies, have been referred to as operational response data by Bedson et al. [4].Although such event streams hold much promise, their use has remained limited due to lack of availability, access and appropriate levels of granularity.Consequently their practical utility is broadly untested.
While researchers have awaited development of the infrastructure to support use of digital footprint data within epidemiological models, they have continued to leverage more traditional data collection methods such as surveying and the harvesting of self-reports [4,5,6,7,8].Surveying at scale, however, remains a logistically challenging endeavour, with sample sizes constrained by both budgets and capacity, along with the recognised potential for response bias [10,11], even when data is obtained directly via mobile applications [5,12].When self-reports are obtained indirectly through analysis of social media accounts they too have similar response biases due to both memory failure and social desirability [13,14].
Alternative digital footprint datasets offer potential to provide a more sustainable and passive route to monitoring longitudinal health behaviours at scale.Such data can be used to supplement and complement qualitative insights, and act at a resolution that can reflect variations within local populations if key transparency, privacy and commercial sensitivity issues can be addressed.While some datasets of this nature, such as mobility data from use of mobile devices, have already been applied to COVID-19 modelling with some success [15,16], no integrated disease models in the UK have incorporated sales data.Self-medication purchasing patterns hold key information about populationwide response to disease, yet their use in health surveillance remains scarce [17,18,19].

Using shopping data logs to examine health
Recorded and stored by retailers across the UK, transactional shopping data consist of longitudinal, time-stamped purchasing logs, specified at store-level geographical granularity.One of the key advantages of such data-sets is that they are updated in real-time, offering capacity to investigate behavioural signals not only across a population, but over time.A variety of studies have reported the potential of such data to provide insights into population health [20,21,22,23].Non-prescription medication purchase logs, in particular, have long been considered a valuable input into health surveillance systems [18,24,25], with Welliver et al. evidencing its potential to indicate influenza at community levels as far back as 1979 [26].Similarly, Hogan et al. showed a high correlation between respiratory and diarrheal outbreaks in children and sales of over-the-counter electrolyte products [27].Socan et al. showed correlation between medicines for sore throat, antitussives, decongestants and mucolytics, and weekly influenza incidence; the sales of antitussives alone could predict increased incident rates [28].
There remains, however, contention as to the effectiveness of integrating such datasets in forecasting systems, a key issue given the surrounding privacy, ethical and transparency challenges of their usage [29].Al-Tawfiq et al.'s review of health surveillance systems for emerging respiratory viruses reported only mixed results in the efficacy of monitoring of non-prescription drug purchases [30]; and while Pivette et al. systematic review of drug sales in outbreak detection demonstrated improvements in provision of earlier alerts, they also noted challenges in selecting indicator drug groups and poor-quality clinical surveillance data [17].Davies and Finch's 2003 UK study over 3 winter periods (examining Nottingham City Hospital NHS, Boots and Reckitt Benckiser pharmacy data), reported that analysis of non-prescription cough/cold medications could provide a two-week warning of respiratory admissions peaks [31], yet Todd et al. found no significant correlation between retail sales of symptom remedies and cases at a sub-national scale for the 2009 UK influenza outbreak [32].
One of the challenges in assessing this prior research lies in the spatial and temporal granularity that researchers have had available to them.Disease transmission predominantly acts at local community levels -yet many studies have only been able to examine datasets aggregated to larger geographical regions.This has restricted many studies to broad-brush conclusions.Todd et al. [32] found that national non-prescription medication sales models, fitted using national spatial areas of the countries England, Scotland and Wales, and 10 health regions of England, were not useful for identifying weekly flu cases.However, models fitted using sub-regions defined by catchment areas for 156 English Primary Care Trusts showed some statistically significant positive correlations between flu cases and sales [32].They concluded using retail sales at finer spatial scales may help augment existing surveillance and merit further study [32].In this work we are able to address some of these prior limitations, investigating whether behavioural data (in the form of non-prescription medication sales) can increase the effectiveness of forecasting models for respiratory death at a weekly resolution, and acting at the far finer geospatial granularity of local authorities.This study investigates whether shopping sales data, drawn from over 2 billion transactions logged by a national UK high street retailer, at store locations distributed across all of England covering 314 Lower Tier Local Authorities (LTLAs), between 2016 and 2020, can improve the effectiveness of models used to forecast weekly deaths by respiratory disease.
It is important to note that establishing the capability of sales data to predict respiratory deaths is, unto itself, insufficient to prove their value to integrated disease models.To warrant investment in the use of sales data in forecasting models, and the citizen privacy and commercial sensitivity issues which then must be surmounted, we must investigate the contribution that such data can offer uniquely, over and above other predictive variables.To this end, in this study we employ an experimental approach specifically designed to identify the added value provided via non-prescription medication sales data in comparison to other covariates, leveraging recent advances in variable importance analysis [33,34].In line with Hofman et al.'s recommendations on computational social science we construct baseline models prior to the addition of sales data, implementing a rigorous out-of-sample testing regime in order to ensure generalizability, with appropriate attention to both prediction and explanation tasks [35].As detailed in §4, and introduced in §2.2 and 2.5, two comparative models are created by attaching a range of independent variables (input features) for each of England's 314 LTLAs, derived from multiple datasets outside of sales data (including the week of the year, weather, temperature, indices of multiple deprivation, age/population level, demographics, housing and land use).A further weekly dependent variable (output feature) is also attached to each LTLA, reflecting respiratory deaths in the authority for that week.We refer to these comparative models as PADRUS (Prediction of Amount of Deaths by Respiratory disease Using Sales) and PADRUNOS (Prediction of Amount of Deaths by Respiratory disease Using No Sales).PADRUS has additional weekly behavioural input features created from shopping data attached to each local authority, encoding medication purchasing patterns across the LTLAs local population each week.These variables are derived from high-resolution sales data sourced from a national UK high street retailer, including both relative and absolute weekly sales levels of a variety of of non-prescription medications bought for managing respiratory symptoms.Predictive models are then trained and assessed to identify an optimal forecasting mechanism based on this dataset, before application of Model Class Reliance (MCR) analysis to ascertain variable importances.
MCR is a relatively new approach that allows researchers to examine the impact of variables not in a single model, but across all models that perform equally well, known as the 'Rashomon set' [36]).No single predictive model can be guaranteed to provide a full explanation of the underlying phenomena it is modelling -in fact, far from it, with forecasting models that produce the same levels of accuracy often employing underlying variables in significantly different ways [33,37].Such instability of explanation is due in part to the complex interactions that occur between variables and the mutual information that can be shared between them, which can serve to occlude confounding variables, and over-state the importance of correlated features [34].Only through interrogation of a range of competing explanations (i.e.well performing models) can the utility and necessity of any individual input feature be rigorously assessed.With automatic enumeration of all competing, equally valid explanations intractable in most cases, MCR provides a mechanism to find the minimum (M CR−) and maximum (M CR+) levels of usage of variables across competing explanations.This can help identify a variable's absolute requirement (M CR−) and its maximal utility (M CR+) for forecasting efforts, rather than arbitrarily selecting a value between the two from a single model randomly selected from the Rashomon set.
By comparing the accuracy of PADRUS's weekly predictions of deaths by respiratory disease in each local authority against the baseline and comparative model (PADRUNOS), and through examination of variable permutation importance bounds, computed by MCR, we investigated the extent to which improvements are dependent on the integration of sales data.

National-Level Exploratory Models
To determine forecast horizons, the length of time between when predictions are made and the actual date when the prediction is to occur, and to establish if linear relationships existed, initial models were developed using sales and outcomes across the whole of England and Wales using 3-31 day intervals between the last day of the weekly sales aggregate (sales are collected Wednesday to Tuesday) and the day of reported deaths (deaths are registered on a Friday)(see Table 1).Using over 5 million transactions from November 2009 to April 2015, weekly sales figures were totalled for England and Wales to make predictions between 7th December 2009 to 13th April 2015.Across this time period the maximum weekly deaths from provisionally registered respiratory disease (as the underlying cause) in England and Wales was 3,521 and the minimum 868, with 378,230 respiratory deaths recorded in total.This exploratory modelling used a separate data-set, covering a different time period, and available prior to acquiring the data used in the following local-level forecasting models.Input sales features were created from total weekly sales of cough, dry cough, mucus cough, decongestant and throat healthcare products, reflecting internal product categories used by the retailer and recorded through point-of-sale logging systems in stores.Regressors predicting weekly deaths from respiratory disease using sales data 17 days in advance achieved the strongest results, producing an out-of-sample R² of 0.81 on the held-out test data (30%).When predictions were made 24 days in advance, results were still strong (RMSE 224, R² 0.77), but performance was notably weaker when predicting within 10 or less days, or 31 or more days in advance.

Local-Level Forecasting Models
To examine the relationship between input variables and respiratory deaths at local authority levels (LTLAs) three models were created and are referred to as the baseline, PADRUS and PADRUNOS.All three models were non-linear, and utilised a random forest regressor (allowing for subsequent MCR analysis [36]), with parameters optimized via a time series cross-validation regime using a grid-search (see §4), and results given on held-out test data (30%).Based on the results in Table 1 for nationallevel exploratory modelling, models used features constructed based on a 17 day forecast horizon for each of the 314 LTLAs in England3 between 18th March 2016 and 27th March 2020.The baseline model was constructed from two variable inputs consisting of week number, reflecting the seasonal nature of ILIs, and LTLA population over 65 4 , reflecting the known vulnerability of this demographic to respiratory disease.For the dependent variable (output feature) the maximum weekly registered deaths in any LTLA from respiratory disease in England was 219 and the minimum 0, with deaths 671,046 assigned in total 5 .As shown in Table 2, despite its simplicity the baseline model showed reasonably high predictive power, forecasting weekly respiratory deaths in each LTLA with an R² of 0.71 (RMSE 4.00, MAE 2.78).
The PADRUS model was next constructed, utilising 56 features derived from sales, weather, demographic, environmental and socioeconomic data (please see §4.3.1 for full details of dependent variables).Sales data used was composed of units sold for a variety of self-medication products across 2,354 retail stores in England during the time period.Sales data variables were created from both the total sales from the week ending 24 days before the day of the prediction, and the total sales from the week ending 17 days before the prediction.These are two seven day non-overlapping periods.We subsequently refer to these as variables with a 17 or 24 day lag.The forecast horizon for the PADRUS model was 17 days.The PADRUS model significantly outperforms the baseline model achieving an R² of 0.78 (RMSE 3.42, MAE 2.39), producing significantly higher levels of predictive accuracy forecasting.In addition to the strong out-of-sample results, we note further the extremely high in-sample fit results, with PADRUS model producing an R² of 0.86 when modelling training data (again please refer to Table 2 for full results).

Variable importance results
As is the case for all current regression tools, PADRUS reflects a single model, against which variable importance tools such as permutation importance or SHAP can be applied.While such analyses cannot be assumed, unto themselves, to reflect underlying phenomena or causal drivers, the mechanism by which a model is functioning can remain of interest.As such, permutation feature importance analysis was applied to PADRUS (see Supplement 2 for full details).The size and age of an LTLA's population was identified as being the most important driver in producing model predictions, and this follows intuition, with mortality rates following respiratory disease being higher in older demographics.However, these factors were closely followed in importance by sales data features, and specifically relative proportion of cough medication purchases.After these top ranked features, of next greatest importance to the model's predictions were IMD concentration (Index of Multiple Deprivation) 6 and weather features, both having greater impact on forecasting than decongestant sales, and housing related features.Permutation importance, while still being informative, is limited in its explanatory power, and feature importance analysis for PADRUS was hence also examined via SHAP (SHapley Addictive exPlanations) (see Figure 1).SHAP values were calculated on a random sample of 1000 data points using the instance of the PADRUS model output by the optimization process.Results again reported that the size of the > 65 population as most important to the model's predictions, closely followed by size of younger populations in an LTLA.However, this was again followed in importance by sales data, specifically all cough, and dry cough medicine sales figures with a 24 day lag.Next was IMD concentration , cough and dry cough medicine sales with a 17 day lag, and thereafter features created from weather data.
While potentially indicative of the important role that sales data might play in respiratory disease predictions, SHAP analysis still only reflects the workings of a single arbitrary instance of the PADRUS model, drawn at random by the optimiser from the Rashomon Set of all equally well-performing models.As such, MCR was finally applied to understand the boundaries of this Rashomon set.The permutation importance bounds (MCR-, MCR+) returned by MCR (see Figure 5, and Supplement 2 for full details) indicate moderate differences in the way independent variables can be used in combination to forecast respiratory deaths.MCR indicated unequivocally that the size of populations in the three age categories remain the most important factor, and that these were irreplaceable by any other variables in making optimal predictions (with an MCR-higher than any other variable sets MCR+).Population over 65 was the most dependable feature for predictions with the highest MCR-.
While overall IMD (deprivation) scores for an LTLA had high importance in some models, this was not the case across the Rashomon set, and their relatively low MCR-scores indicated that they could be replaced by other variables (this excludes IMD concentration which consistently remains important).In contrast the features with highest MCR-scores, indicating that were most strongly required following population sizes, were features derived from cough medicine sales, with cough medicine sales with a 24 day lag outperforming a 17 day lag.Subsequent features with relatively high MCR-scores compared to other remaining variables, were IMD concentration, decongestant sales with a 17 day lag, minimum temperature, maximum temperature, average temperature, week number (weeknum), and decongestant sales with a 24 day lag.Variables besides the aforementioned showed consistently low MCR-scores, yet several retained stronger MCR+ scores which indicates that these input features were likely broadly interchangeable with each other.
When a large set of independent variables are used as inputs to a predictor, assessing the importance of each feature to achieving a models' optimal accuracy can be challenging, and results are difficult to interpret in unison.To address this we applied Group-MCR, which computes the importance of a category of features in concert, calculating the minimum and maximum impact variable groups could have on the predictions across all instances of the model [37].Results for Group-MCR are reported in Figure 2. Features related to population age are substantially the most important group of features (MCR-22.84,MCR+ 26.05).When grouped, sales features, however, are markedly second in importance to achieving optimal predictions (MCR-11.99,MCR+ 12.87).Following local population demographics, non-prescription medication sales are most central to models producing optimal 17-day forecasts of respiratory deaths in local authority regions.).The high MCR-/+ for grouped sales features means that there are no models within the Rashomon set that can achieve their predictive scores (R 2 0.78) without this set of variables (Figure 2).

Comparison to models without Sales Data
MCR analysis indicates that the prediction accuracy produced by PADRUS could not be achieved without the inclusion of sales data variables.However, MCR in its current form, cannot tell us the exact accuracy increases gained through their inclusion -only that they are essential to those gains made.To compute the difference in accuracy, a further model PADRUNOS was used, constructed from the same variable set as PADRUS but omitting features derived from sales data.To derive the PADRUNOS model, a random forest regressor was again optimised using a time series cross-validation grid-search to predict weekly deaths from respiratory disease 17 days in advance for the 314 LTLAs.
Results were notable in that the PADRUNOS model achieved a strong R² of 0.76 (RMSE 3.61, MAE 2.49) on the 30% held-out test dataset (20,410 data points), despite the absence of self-medication sales (see Table 2 for full details).While the improvements made by PADRUS over PADRUNOS detailed in Table 1 were significant across all treatments (p-value < 0.001, using Bonferroni corrected Wilcoxon signed-rank tests7 ), accuracy increases were relatively small (0.02 R²).Group-MCR was again applied to investigate how PADRUNOS was compensating for the loss of sales data (see Figure 2

Assessing critical prediction periods and suppression
We observed that optimal forecasting models are dependent on sales data; yet broader analysis appeared to indicate that the accuracy gains arriving through their inclusion are relatively small.To understand this apparent contradiction, and explain how sales data can simultaneously appear to be both critical to predictions while only providing limited improvements in accuracy, we examined weekly time-series forecasting results across LTLAs for both models (plotted in Figure 3).Forecasting graphs illustrate that although both PADRUS and PADRUNOS follow similar predictive patterns between 2016 and 2020, PADRUS is better at capturing key spikes in deaths rates that are underestimated by PADRUNOS.It is here that inclusion of sales data obtains accuracy gains, over and above reliance on seasonal indicators and weather-related variables.Comparing PADRUS and PADRUNOS prediction scores (R², MAE and RMSE) by week (Figure 6, Supplement 3) further highlights how the inclusion of sales data increases the PADRUS models predictive accuracy especially in winter months.Accuracy gains are hence limited over the whole year, but they do occur in periods most important to early warning systems.
A second key observation is obtained through examination of how results vary across individual LTLA areas (see Supplement 4 for more details).These indicate that prediction accuracies are being impacted by inclusion of suppressed data. 8When predictions were limited to the 41 LTLA areas with data that contained no suppressed weeks, it became clear that the PADRUS model was not only performing significantly better than PADRUNOS, but with far clearer effect sizes.For unsuppressed LTLA data, PADRUS achieved an R 2 of 0.70 (RMSE 5.87, MAE 4.37) compared to PADRUNOS' R 2 of 0.63 (RMSE 6.50, MAE 4.80).As shown in Table 2, this reflects an improvement in R 2 of 0.07, indicating that suppression was simplifying the true prediction task, to the advantage of models without behavioural sales data.
Further, when analysis of predictions in these LTLAs is constrained to key periods for respiratory disease (e.g.November -February) the improvement in performance of the PADRUS model (R 2 0.69, RMSE 6.13, MAE 4.83 ) further increases in comparison to models without sales data (R 2 0.58, RMSE 7.07, MAE 5.52).Winter periods, when vulnerability to respiratory disease is most prevalent, reflect weeks when suppression is less likely but also when regressors have the most scope for error due to higher potential incidence.Depending on the exact start and end month chosen to capture these critical periods for forecasting, inclusion of sales data increases R 2 between 0.09 to 0.11 (again full results are detailed in Table 2).

Discussion
Results of this study evidence that the inclusion of non-prescription medication sales data commonly bought for managing respiratory symptoms, alongside traditional variables, can contribute to increases in forecasting accuracy for respiratory deaths.Models corroborate prior work [20,21,22,23], but also provide potential insights into why contradictory results exist within the literature [30,17,32].While initial results in this study achieved only relatively small effect sizes in across-the-year accuracy gains (with an increase of 0.02 R 2 over models without sales data), further analysis revealed that improvements were being constrained by over-simplification of the modelling task due to suppressed data, and a lack of focus on periods when incidence of respiratory disease is the most volatile.When these issues were attended to, the gains made through inclusion of sales increased markedly, with out-of-sample forecasting increasing in predictive performance by 0.11 (R 2 ) when models included behavioural sales data.
In addition to confirming the influence of age and population size, variable importance analyses (in the form of MCR/Group-MCR) showed that optimal predictions were achieved using medication sales data, notably the cough sales medication feature with a 24 day lag (with IMD concentration and temperature also providing utility).The importance of cough sales recorded 24 days in advance shows the potential for predictions from sales data to have forecast horizons long enough in length to inform the forward planning needed for allocating health resources.Why sales features with 24 day lag appear to have a higher variable importance to the PADRUS model than those with a 17 day lag needs further investigation, however the length of this duration indicates that lagged medication sales may not be predicting hospitalisation in the people that purchase them -but of a growing risk within a community.This risk could be an increase in infectious diseases affecting the respiratory system, or another environmental change which is exacerbating and/or increasing chronic lung conditions.Confirmation of this hypothesis, however, would require accompanying individual level studies.
This study considered four continual years of weekly retail sales data in 314 LTLAs across England, providing finer resolution evidence for previously hypothesised relationships between 1. sales of cough and decongestant medications and rates of ILI [28] and 2. sales of cough medicines and peaks in hospital admissions for respiratory illness two weeks later [31].No significant evidence for the relevance of other non-prescription medications, such as throat remedies, was found.Due to the temporal granularity of the dataset made available to this work, it was also possible to examine a range of the models results constrained to particular periods across the year (see Table 1 for a full list of these).PADRUS provided significant improvements over PADRUNOS (and accounting for Bonferroni corrections) but sales data proved most valuable to the periods covering October to February.This is likely due to both the increased magnitude of deaths over this period, and hence the higher potential volatility of mean squared error that can occur -periods that forecasting models are of most utility to decision makers.
MCR analysis suggests that the gains occurring due to inclusion of sales data are due to weather/ temperature variables being unable to sufficiently compensate for when disease incidence veers away from seasonal norms.A relationship between temperature and ILIs is well noted in the literature, but there is no current consensus on how colder weather might cause the seasonal increase in respiratory disease [38].Moreover, when disease incidence diverges from historical patterns due to feature drift, in this use-case a rapid change in the relationship between seasonable variables and deaths, predictive models are unable to respond quickly enough.An example is the 2017/18 influenza season in the UK, which was unexpectedly characterised by co-circulation of both influenza A (H3N2) and influenza B. Vaccination regimes were unable to respond accordingly to this situation, leading to a significantly large number of respiratory outbreaks and increased hospital admissions [39].In PADRUNOS, seasonal/temperature variables were unable to adapt to this change in causal relationship (or 'feature drift'), whereas in PADRUS the relationship between non-prescription medication sales and deaths remained far more stable.For disease surveillance under adaptation, variables that respond to disease prevalence, and obtained from direct empirical observation of human behaviour, may provide a more stable and reactive forecasting mechanism.
Predictive mechanisms able to adapt to changing circumstances are, of course, also highly relevant to the surveillance of pandemics such as COVID-19, where new unpredictable disease occur due to such factors as antigenic shifts.While analysis of this period was outside the scope of the data available for this work, a small crossover period does occur (up to March 27th).At this point the PADRUS model predicts that deaths from respiratory disease were set to increase, a response to increased nonprescription medication sales it observes, whereas PADRUNOS model, relying on seasonal variables, predicts a fall.There is insufficient data covering this period, however, to test whether this forecasting advantage would continue -with reported panic buying of medical supplies and a shift into online shopping, feature drift may have weakened the relationship between sales and hospitalisations.However with COVID-19 still prevalent worldwide and continuing to regularly defy the normal seasonal trends of ILIs, further research is required to investigate if the potential holds to underpin early warning mechanisms.
This study is limited by its use of in-store purchases from one UK high street retailer despite the stores wide coverage across England.While the use of a single lens into medication sales does not impact the validity of forecasting results, and indeed likely means that reported forecasting accuracies only represent a lower bound of potential forecasting improvements, representativeness of results are difficult to examine.Any real-world forecasting model based upon a single source of data is necessarily biased towards the demographic it featured most heavily.Furthermore, areas of the country where a retailer has greatest presence (which in this case tend to be urban, rather than rural locales) will feature most predominantly within a regression, and consequently receive higher probabilities of accurate predictions.This may be attended to by normalisation/weighting towards representative areas/customer types, but further investigation is required.
We also note that an anomaly in the data occurred on the 6th January 2017 where registered deaths were very low for the time period.Public holidays affect the process of death registrations [40], and the anomaly may have occurred due to Christmas Day in 2016 occurring over a weekend.A final limitation, and a potentially fruitful route for further research, lies in the need to undertake data linkage between individual customer data and clinical outcomes if we are to unpack drivers of model accuracy.This would require a participatory cohort study of shoppers willing to undertake data donation, and careful sampling across demographics, geospatial regions and of those who have/have not experienced respiratory disease.
This study provides evidence of the potential value in integrating national longitudinal shopping data into respiratory disease forecasting models acting at local levels.Analysis indicates value in including sales data in our models in order to identify irregularities outside seasonal norms, and in periods where deaths from respiratory disease were most prevalent and varied.With COVID-19 now endemic in the population the potential of shopping data to improve forecasting is even more valuable for aiding healthcare decision-making.The prospect of increasing the capability of sales data to improve forecasting models for respiratory disease is high; the PADRUS model is at the lower bounds of possible predictive accuracy with an array of limitations and improvements that future research could address.

Shopping, Mortality and Socio-demographic data
Data was collected for this study through working partnerships with: the UK National Health Service (NHS); a UK commercial high street retailer; and via open source data providers.Data sources and descriptions used in this study can be seen in Supplement 1, which lists the data used for both dependent and independent variables.Data underpinning medication sales variables originates from two data-sets made available to the project by a national UK high street retailer, incorporating: 1. a 20% sample of timestamped commercial sales transactions (including over 5 million transactions of non-prescription medications for cough, decongestant and throat) from England and Wales recorded through 2,702,449 loyalty cards from November 2009 to April 2015; and 2. a data-set of all in-store weekly commercial sales transactions in England (over 2 billion) labelled by store location, covering the period between March 2016 to March 2020.Commercial sales data was sales units of all in-store transactions with store location only; no sales transactions were linked to individual customers, and no personal data was used in this study.Other independent variables used in the models are derived from: English indices of deprivation 2019 [41]; Nomis official census and labour market statistics [42]; Housing age data from the Valuation Office Agency 2020 [43]; ONS Census 2011 [44]; Land use in England from 2018 live tables from the Department for Levelling Up, Housing and Communities and Ministry of Communities and Local Government [45]; and Weather ERA5 data from the European Centre for Medium-Range Weather Forecasts [46].
Data was collected or aggregated to the geographic regions of LTLAs (Local Tier Local Authorities) 9 .LTLAs were used because, as geographic areas used by government data, they would enable direct inclusion of a wide range of demographic, environmental and socioeconomic data previously found to be significant to the risk of death from respiratory disease (Supplement 1).LTLAs were chosen over other spatial divisions, including census areas MSOAs (Middle Layer Super Output Areas) and LSOAs (Lower Layer Super Output Areas), to find a balance between limiting data suppression of deaths done to maintain data de-identification, and maintaining a high enough level of spatial granularity for data to have a significant impact on predictions.
The output variable for the models, respiratory deaths in each of the England's 314 Local Authorities (LTLAs), originates from two datasets supplied by the UK Office of National Statistics (ONS) and the National Commissioning Data Repository (NDCR) containing: 1. details of total weekly registered deaths from respiratory disease (as the underlying cause)(ICD 10 coding: J00 -J99) in England and Wales from 7th December 2009 to 13th April 2015 [47]; and 2. ONS data on all weekly registered deaths from respiratory disease (ICD 10 coding: J00 -J99) by 314 LTLAs in England from 18th March 2016 to 27th March 2020.The health data in this study was used under the terms of usual practice for research as defined in the UK Policy Framework for Health and Social Care Research, with the study designed to investigate the health issues in a population to improve population health [48].To keep health data de-identified and extractable from the NDCR, death counts below 5 within an LTLA were suppressed and reported as 5 prior to receipt.Suppression was between 20% and 25%.
Open source data availability for UK LTLAs outside of England was limited, and consequently analysis was for local-level forecasting was restricted to 314 LTLAs in England alone.Even with this restriction other variables under consideration for inclusion such as search engine trends and temporal pollen, traffic, and air pollution counts, could not be included as there was no access to these datasets at this level of geospatial granularity within the time limits of the project.Data matching, the time series of available target predictions and available dynamic datasets of sales and weather, gave the dataset its time-frame of the 18th March 2016 to 27th March 2020 for weekly deaths by respiratory disease.

Exploratory modelling / National-level Analysis
Preliminary exploration of the national UK relationship between sales data and respiratory deaths was undertaken before modelling at local levels to better understand potential forecast horizons.For this analysis a distinct preceding time period was considered, using ONS data on weekly registered deaths from respiratory disease (as the underlying cause) in England and Wales from 7th December 2009 to 13th April 2015 [47] and medication sales data drawn with forecast horizons of 3, 10, 17, 24, 31 days (forecast horizon refers to the time lapse between the date of registered deaths being predicted and the date of reported sales).Independent variables consisted of 'week number' (1 to 52) and features created from commercial sales data, including absolute weekly sales units for: cough medication, dry cough, mucus cough, decongestant and throat healthcare products, reflecting the data partner's internal product categories.Sales features were aggregated by week from over 5 million over-the-counter transactions recorded by 2,702,449 loyalty cards.Data was temporally stratified into a training (70%) and test set (30%), and out-of-sample performance examined against the held-out test data via RMSE (root mean squared error) and R² (coefficient of determination).Several linear regression models were trained (one for each forecast horizon at which sales data was engineered) and / Community 3,000m).These catchment ranges reflect 2σ (95% of the kernel) and were devised in collaboration with the data provider.The influence a store, s, had over any point, x, can therefore be assigned as: The 'influence' the store had over an LTLA is then determined by examining the value of the store's probability density function at that LTLA's population weighted centroid 10 .The 'total influence' of each store can then be calculated by summing the influences for the population weighted centroid of every LTLA, L, from the 314 available, L: The amount of sales that each store attributes to a given LTLA, L, can then simply be assigned as a proportion of the influence the store has over that LTLA compared to its total influence overall.If sales are then aggregated for each store then we can finally produce an estimate of category sales in that LTLA overall.Formally for any sales category, c, the absolute sales assigned to an LTLA, L, in particularly week, after aggregating all contributions of every store, s, is: Total sales in any given week, however, does not always represent an informative variable.This is due to the uneven geographical distribution of the retailer's stores across the country, and the different number of customer bases they serve.A more informative variable is the percentage of sales that a category, c, is contributing that week -in comparison to sales across all product categories, C. To this end, we derive a local sales ratio variable for an LSOA, L, as follows: Here C represents all product categories sold by the retailer (including but not restricted to nonprescription medication categories) and we are therefore normalising by overall sales in that LSOA.This prevents modelling bias towards regions where there is a higher population or number of retailer stores.
It is useful to compare the sales ratio occurring in an LTLA with the national average, as this allows to defend against both feature drift and seasonal fluctuations.To achieve this we define a local sales multiplier variable, which allows us to assess whether non-prescription medication sales are proportionally more dominant in one region compared with another (making this feature more invariant over time): These three features total sales, sales ratio and local sales multiplier make up the central sales variables within our models (with versions for cough medicines, dry cough medicines, decongestant medicines and throat medicines -products selected due to reports in prior literature associating a rise in their sales with an increase in respiratory illnesses [28,31]).

Modelling Procedure
The analysis of registered deaths from respiratory disease was framed as a regression task with the amount of deaths predicted (the target) a continuous output integer variable, y.The modelling task was prediction of weekly respiratory deaths 17 days in advance for each of the 314 LTLA areas in England from 18th March 2016 to 27th March 2020.For all models (PADRUS, PADRUNOS, baseline) data was temporally stratified into the same training set (45,844 data points from 18th March 2016 to 28th December 2018), approx.70%) and test set (20,410 data points from 4th January 2019 to 27th March 2020, approx.30%) Random forest regressors were trained and evaluated, with meta-parameters for the model optimised using a time series cross-validation (TSCV, 4 splits) and grid search to prevent over-fitting.TSCV is an alternative to k-fold cross-validation, which is applied to prevent temporal data leakage and the risk of over-reported accuracy results.TSCV iteratively applies a 'walk-forward' testing process, in order to robust evaluate a models' ability to generalise.Once optimal meta-parameters had been identified, the model was re-fit to the training dataset, and evaluated against the held-out test set.Three metrics were used to score the regression task: RMSE (root mean squared error), MAE (mean absolute error) and R² (coefficient of determination).
To assess whether the PADRUS model created was valuable, two baseline models were created.The first, baseline, was created using the same methodology; a random forest regressor optimized using a time series cross-validation grid-search on training data.The baseline performed the same task of predicting weekly respiratory deaths 17 days in advance for each of the 314 LTLA areas in England from 18th March 2016 to 27th March 2020.Data was stratified using the same splits on the training and test data (45,844 training data points, 20,410 test data points), and model predictions on the held out test data were scored using RMSE, MAE and R².The only difference between PADRUS and the baseline model is its feature inputs, with the baseline inputs consisted of only: the week number (1 to 52) for a dynamic input, and secondly the static input of the population of over 65s in each LTLA.The first feature was chosen because of the known seasonality of deaths by respiratory disease [38], yet used alone the prediction accuracy levels were very low.Therefore the second feature was chosen due to the assumption that the size and age of population would greatly influence the number of deaths, with 90% of deaths from respiratory disease in Europe occurring in those aged 65 and over [49].The second comparative model, PADRUNOS, was formed in exactly the same way, and used the exact same features as PADRUS except any that related to sales data (see §4.3.5)

Variable Importance Analyses
In order to understand the impact of the different features driving predictions in the PADRUS model, we conducted a range variable importance analyses.First standard permutation importance was used to score of each feature in producing the model's predictions, relaying the expected increase in error after the a features contents are randomly shuffled.Second, a SHAP (SHapley Additive exPlanations) Analysis [50] was conducted.SHAP carries out a more complex calculation of feature importance, with the Kernel Explainer SHAP tool used in this analysis applying weighted linear regression to determine the importance of each feature based on Shapley values, as derived from game theory [50].Both tools were applied to the training data set on an arbitrary instance of the PADRUS model.The SHAP analysis is computationally expensive approach and therefore was run on two random samples of data points of 100 and 1000.This enabled a comparison of results between sample sizes to ensure a large enough sample had been evaluated.
By running these variable importance tools on one arbitrary model, misleading results can be given due to the stochastic nature of the models machine learning algorithms.This is a problem because different instances of an optimised model class can use different variables (features) and in different ways to achieve the same model predictive performance.To address the problem of standard variable importance tools only evaluating one instance of the PADRUS model, MCR (Model Class Reliance) was applied.MCR was developed by Fisher et al. to compute the feature importance bounds across all optimal models called the Rashomon set for Kernel (SVM) Regression (polynomial run-time) [33].Smith, Mansilla and Goulding introduced a new technique that extends the computation of MCR to Random Forest classifiers and regressors [34].MCR builds on permutation for a single model, computing the permutation feature importance bounds (MCR-, MCR+) for an input variable across all instances of PADRUS; calculating the minimum and maximum impact a variable could have on the predictions across all instances of the model (See figure 4).This initial MCR analysis evaluated each feature individually for its importance, for example the importance of 'minimum temperature'.It did not assess how important a dataset type used to create a number of the models' features was to predictions as a whole, for example 'weather'.Ljevar et al. introduced a Grouped Feature approach to MCR for random forest [37].Group-MCR was created in order to calculate the effects of variable groups, measuring the importance of a collection of features together on the predictions of random forest classifiers [37].In order to evaluate the importance of groups of variables, and their impact in concert on the PADRUS model, we apply for the first time Group-MCR to a random forest regressor.Group-MCR is achieved through a modification of the random forest MCR algorithm, which reconsiders the definition of Model Reliance to be with respect to a group of variables rather than a single variable [37].

Modelling without sales data
Although MCR can explain which variables need to be included in a model to achieve the maximum accuracy rates for predictions, it can not compute the difference in accuracy (the loss) if those variables were to be excluded from the model.In order to determine the loss in accuracy if sales data were to be left out of the model, the model PADRUNOS (Predicting the amount of deaths from respiratory disease using no sales) was created as a comparison to PADRUS.PADRUNOS was created inline with the baseline and PADRUS model methodology, as a random forest regressor optimized using a time series cross-validation grid-search on training data.MCR and Group-MCR was applied to PADRUNOS to calculate how the variables were used differently.The three metrics used to score the regression task were compared for both models PADRUS and PADRUNOS.These scores were examined overall, by week, by LTLA and by winter months.Wilcoxon signed-rank tests (two-sided) were used to compare the significance of model's predictive accuracy scores.

Data Availability
The health data used in this study is not publicly available but can be requested via NHS England and Improvement NCDR [51].The shopping dataset used in this study is commercially sensitive and therefore not available for access.All other datasets are open source and can be accessed via the website links given in Supplement 1.

Code Availability
The analysis code developed for this paper can be found online at https://github.com/nhsx/commercialdata-healthcare-predictions

Figure 1 :
Figure 1: SHAP values for PADRUS Features for a side-byside comparison with PADRUS).Results show that the model became even more reliant on the grouped variables of age (MCR-37.91,MCR+ 41.21).The dynamic variable, weather (MCR-7.55,MCR+ 7.54) increased in its relevance, becoming the second most important feature used for forecasting.The related variable, week number (MCR-2.04,MCR+ 2.15) which encodes seasonality also increased its importance.All other grouped variables increased their minimum and maximum permutation bound scores, apart from land use (MCR-1.15,MCR+ 3.85) where the MCR-decreased.

Figure 3 :
Figure 3: Line Plot of PADRUS and PADRUNOS predictions and actual number of respiratory deaths.Lower panels provide magnified visualisation of the proximity of the blue area outlines (predictions) to orange areas (targets) in PADRUS and PADRUNOS at specific peak and trough periods.

Figure 4 :Figure 5 :
Figure 4: Diagrammatic representation of the difference between other variable importance tools and MCR

Figure 6 :
Figure 6: Line Plots showing how the inclusion of sales in the PADRUS model changes R², MAE and RMSE predictive scores especially in the winter months.(Three data points have been left out where error scores were higher to create a high granular view to compare models.These data points were the anomaly 2017 week 1, 2020 week 1, and 27th March 2020 the first week of the UK lockdown)

Table 2 :
Results from experimental models predicting registered deaths from respiratory disease 17 days in advance using over 2 billion in-store sales transactions with store location from March 2016 to March 2020.