Tesco Grocery 1.0, a large-scale dataset of grocery purchases in London

We present the Tesco Grocery 1.0 dataset: a record of 420 M food items purchased by 1.6 M fidelity card owners who shopped at the 411 Tesco stores in Greater London over the course of the entire year of 2015, aggregated at the level of census areas to preserve anonymity. For each area, we report the number of transactions and nutritional properties of the typical food item bought including the average caloric intake and the composition of nutrients. The set of global trade international numbers (barcodes) for each food type is also included. To establish data validity we: i) compare food purchase volumes to population from census to assess representativeness, and ii) match nutrient and energy intake to official statistics of food-related illnesses to appraise the extent to which the dataset is ecologically valid. Given its unprecedented scale and geographic granularity, the data can be used to link food purchases to a number of geographically-salient indicators, which enables studies on health outcomes, cultural aspects, and economic factors.


Background & Summary
Tesco is a British multinational grocery and general merchandise retailer. In 2015, it was 9th highest-grossing retailer in the world, with 81B in global revenue 1 and the biggest grocery retailer in UK, with 28% of market share 2 . Tesco operates a loyalty scheme where customers apply for a Clubcard that is used for both in-store and online purchases to accumulate points that can be later spent to redeem prizes or discount vouchers. With the customer consent, the record of their purchases is archived and anonymously linked to their Clubcard number. In this paper, we focus on the in-store purchases done in the 411 Tesco shops within the boundaries of Greater London during the entire year of 2015. We present aggregated and privacy-preserving data views that combine individual purchases at different spatial granularities, from Lower Super Output Areas (containing around 2,000 residents each, on average) to Boroughs (more than 250k residents, on average).
Despite the importance of studying food consumption at scale, there is little data about what people actually eat over long periods of time. The fine-grained geographical information included in Tesco Grocery 1.0 is the key to link food consumption data of an entire city to any attribute that can be measured at the level of statistical census areas. These include cultural aspects (ethnicity 3 , migration 4,5 ), societal aspects (youth alcohol use 6 ), economic factors (deprivation 7 , inequality 8 ), health determinants (medical prescriptions 9 , health awareness and daily habits 10,11 ), and social media discourse (textual 12 or visual 13 descriptors of geo-referenced posts).
Several studies mined grocery sales data (which has not been made publicly available) to, for example, build recommender systems that are able to suggest what people might like based on their past purchases [14][15][16] , or establish whether healthy foods tend to be pricey 17 and, ultimately, whether their purchase tends to be mediated by price sensitivity. The nutrient composition of ready-made meals was studied 18 , yet that represented potential availability of nutrients given the lack of sales data. Only recently, Instacart-a company that delivers groceries from local stores-published a dataset of 3 million grocery orders from more than 200,000 users 19 , yet neither geo-location information nor nutritional information for these orders is available.
Web data has been used to study the relationship between food and health. Food-related online communities have made it possible to study dietary patterns across entire countries [20][21][22] , and determine how these patterns relate to local cultures 3 . Websites containing food recipes 23 have been mined to: study the complex relationships among different ingredients; 24,25 quantify a recipe's healthiness 26,27 upon which health-conscious food recommender systems were then built; 28,29 and relate potential food consumption to health outcomes 30 .
A particular class of Web data from which food and nutrition information has been extensively extracted is that of social media data: 31 from geo-referenced tweets, food mentions were extracted, and corresponding caloric values were found to correlate with state-wide obesity rates; 12 from images of food, researchers were able to extract the portrayed food items 32,33 and even their ingredients; 34 and from geo-referenced images, researchers were able to study the relationship between food mentions and socio-economic conditions 35,36 . These datasets are publicly available, yet they do not reflect sales data: Web data suffer from a number of self-presentation and self-selection biases, which might yield a distorted picture of actual food consumption patterns 37,38 . To partly fix that, we recently analyzed the eating habits of Londoners based on Tesco data 39 , a richer version of which we now make publicly available.

Methods
Next, we detail how the data is collected and aggregated (high-level sketch in Fig. 1). purchase record collection. Given the use of Clubcards, the information about the purchase of each individual product is stored in the following anonymized form: {customer area, GTIN, timestamp}, where the area is where the customer lives, and the GTIN is the Global Trade Item Number, which is used by companies to uniquely identify their trade items globally. The purchase record is then joined with a product database that is maintained by Tesco and it is populated with the information that the producers of the food items provide. The facts that are printed on the nutrition label 40 are: {total energy, net weight, fats, saturated fats, carbohydrates, free sugars, proteins, fibers}. Only for drinks, their volume and relative volume of alcohol is reported.
The total energy is expressed in kilocalories (kcal). The nutrients are expressed in grams (g) and they represent the total weight of that nutrient in the product. The weight of fat comprises that of saturated fats and the weight of carbohydrates comprises that of sugar. We compute the total weight of alcohol by multiplying the total volume by the relative volume of alcohol, and by the density of alcohol. For example, there are 75 grams of alcohol in a 750 ml bottle of wine with 12.5% alcoholic volume ( ) ml g 750 0 125 0 8 75 To obtain the calorie intake at the nutrient level, we use the conversion factors set by EU directive 90/496/EEC 41 , which is implemented by all EU countries and is widely adopted in nutrition studies 42 . That is, we map grams into corresponding calories by Fig. 1 The Tesco Grocery 1.0 data collection process. We obtain the nutritional properties of an area's typical product by: (i) considering the purchases made with fidelity cards at tills (raw data); (ii) mapping those purchases to their nutritional properties (as per product labels); and (iii) collating the nutritional properties at area-level (based on the areas the fidelity cards were sent to), and averaging those properties out. This results in the nutritional properties of each area's typical product.
Geographic areas and census statistics. We aggregate individual purchase records at area-level using a variety of geographic resolutions: LSOA (Lower Super Output Area), MSOA (Medium Super Output Area), Ward, and LA (Local Authority or, more informally, Borough).
The choice of these specific aggregation levels is motivated by three main factors. First, they are spatial partitions defined by the Office for National Statistics (ONS-https://www.ons.gov.uk). Many publicly-available official statistics are provided at the level of these areas and, consequently, most geographic studies on UK data use them as reference geographic units. Second, the ONS provides the mapping between different levels of spatial aggregations 43 , which facilitates the aggregation of our raw data to coarser partitions. The mapping is always exact except at Ward level, where some higher-granularity areas might lie across the boundaries of multiple Wards. The ONS applies standard guidelines to provide a best-fitting match. Finally, aggregating the data at multiple granularities will enable a wider range of studies that might benefit from having either a high number of smaller areas containing fewer datapoints (but still big enough to preserve anonymity); or a lower number of areas characterized by more robust statistics. In the dataset, for each area, we report a number of basic census statistics collected by the ONS 44 in 2015 including: population, number of males, number of females, number residents aged [0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17], number residents aged , number residents aged 65+, average age, surface area, density of residents. Table 1 summarizes the average of some of these statistics. aggregation of food purchases to areas. Given a purchase done by a customer living in a given area a, we map the purchase to that area. The same aggregation procedure is applied for all geographical levels, so we use the generic term area to denote census areas at any level, from LSOA to Borough. For each area, we provide 3 sets of variables, described below.
penetration. The first group of variables expresses the Tesco penetration in an area. For each area a, we report the total number of products purchased (quantity (a)). To give an indication of how the customer base is representative of the overall population in the area, we compute the ratio between the number of unique customers and the number of residents, as recorded by the census: In the dataset, we provide the min-max normalized versions of these values, which can be used to filter out areas whose user base differs the most from the census statistics (see Technical Validation Section): We also report the number of days with at least one purchase (purchase-dates(a) ∈ [0; 365]). To provide an estimate of how often residents of area a go shopping, we supply the collective sum of days any clubcard in area a has been used (man-day(a)).
Nutrients. The second group captures the nutritional properties of the food purchased. Because the number of people consuming the food purchased with a Clubcard is unknown (e.g., singles vs. families), we cannot produce reliable estimates of per-capita food purchases. Therefore, we describe each area in terms of its typical product, whose nutritional values are the average over all the food products bought by the area residents.
We first report the typical food product's weight (not including drinks) and the typical drink product's volume: www.nature.com/scientificdata www.nature.com/scientificdata/ = ∑ ∈ weight a grams p where P a is the set of all food products purchased by the residents of area a, and p is one of such products. The energy intake of food is measured in calories. To capture the energy intake, we compute the total calories contained in the typical product (i.e., the average number of calories across all products purchased by area residents): where kcal(p) is the value of kilocalories in p.
Given two foods with the same amount of calories but different energy concentrations, the delivery of pleasure within people's brains is quicker for the calorie-dense food. To capture the density of calories rather than simple calorie counts, we compute: p P p P a a which reflects the concentration of calories in the area's typical product. Not all calories are created equal though: the food contains different types of nutrients that the body processes in different ways to produce energy and extract structural material for its growth and maintenance. The food nutrients we consider are: fats, saturated fats, carbohydrates (carbs), sugar, proteins, fibers, and alcohol. For each area, we compute the grams of each individual nutrient contained in the typical product: where grams(nutrient i (p)) is the grams of nutrient i in product p. We calculate the typical product's energy content given by each of these nutrients as: where kcal(nutrient i , p) is the energy intake given by that nutrient in product p.
Going beyond individual nutrients, one could study their composite impact. So, we capture the diversity of nutrients contained in the typical product. This is computed as the Shannon entropy H of the distribution of the calories given by all the nutrients: where category i (p) is an indicator function set to 1, if product p belongs to category i; 0 otherwise. We then calculate the relative weight of products belonging to any category compared to the total weight (this is done only for food products, not for drinks). We also calculate the entropy (and, like for nutrients, normalized entropy) of these relative weights.
category weight j category weight category weight 2 j j Biases and limitations. Representativeness. The sample of people whose purchases are reflected in this dataset is very large but not random, as it is a set of self-selected people who decided to shop at Tesco and to opt in for a Clubcard subscription. Therefore, the set of Clubcard owners considered is not representative of the overall population in terms of demographics, socio-economic factors, or spatial distribution. However, we provide guidelines on how to filter the data to increase robustness (see Technical Validation Section).
Coverage. As we will show in the Technical Validation section, the concentration of Tesco stores is higher in the northern part of London. As a result, some areas of the city exhibit low penetration. In the dataset, we provide the information to identify low-coverage areas and filter them out if needed.
Limited scope. The number of purchases considered is very large but by no means covers the full scope of food consumption habits of London residents. Naturally, this dataset does not reflect food consumption in restaurants. Also, it does not cover the grocery store purchases done in other chains or stores and, even among the Tesco customers, it does not include information of online purchases and of products purchased by people who do not own a Clubcard. In short, albeit this dataset reveals trends in food consumption habits at area level, it does not represent the full picture of daily food consumption.
Average product. From our data, there is no way of estimating what is the diet of individual customers. Therefore, the nutrient values are provided as averages over all the items purchased by the residents of an area.
In other words, we represent the nutritional features of the hypothetical average product consumed in the area. This type of representation is appropriate for some types of analysis but introduces limitations to any study that requires an average representation at the level of individuals, rather than at the level of geographical area.

Data Records
We made the dataset available through Figshare

Food items and their categories.
We provide the list of products contained in each food category. The identifiers associated with products can be used to query external services to get additional information about them. This information is stored in a single .csv file named food_categories.csv.

Technical Validation
This dataset is reliable when considering a variety of aspects: it reliably records all the purchases made at the supermarket tills when using fidelity cards; reports the nutritional information of individual products as accurately as the labels of the original products do; and comes with verified customer locations (they are the areas to which the fidelity cards were sent to). Yet, there are two aspects that need validation: i) the representativeness of the Tesco customer base; and ii) the ecological validity of the aggregated grocery data, namely how well they represent the general patterns of food consumption at area-level, which we estimate through food-related health outcomes in those areas.
Validation of area representativeness. In each LSOA, we consider two quantities: number of customers and number of residents. Their frequency distributions differ (Fig. 2): that of the number of residents is bell-shaped (as LSOAs have been designed to partition a city in patches containing a comparable number of residents), while that of the number of customers is skewed (not least because, as Fig. 3 shows, Tesco stores are not uniformly distributed across the city). Therefore, not only penetration rates of stores are distributed unevenly across the city, but also those of purchases are. As a result, in certain areas, the customers are not representative of the general population.
To fine-tune the desired level of general population's representativeness, researchers can consider not all areas but only part of them. To allow them to do so, we report each area's representa tiveness(a) (the ratio between the number of customers and the number of residents). The higher it is, the more representative that area's data is of the general population. We find that, if one considers only the areas with at least 0.1 representativeness (x-axis in Fig. 4 (left)), one is left with 80% of the total number of areas (x-axis). That is, at least 10% of the residents are www.nature.com/scientificdata www.nature.com/scientificdata/ customers in each area. This is a conservative estimate as it assumes that each fidelity card is used to purchase food for one single person, while it might well be used to satisfy the needs of an entire family.
Then, if one were to take that top 80% of the areas ranked by representativeness, the overall correlation between the number of customers and the number of residents would be considerable: it goes from 0.38 for LSOAs to 0.58 for wards (Fig. 4 (right)). If one were to take only the 10% most representative areas instead, then that correlation would be as high as 0.75 for LSOAs to 0.84 for wards (Fig. 4 (right)).
More generally, the larger the set of considered areas, the lower that correlation: depending on their needs, researchers have to balance the number of total areas they wish to consider, and the correlation between the number of customers and that or residents they wish to obtain. For example, considering 80% of the areas yields correlations between 0.38 (LSOAs) and 0.58 (wards).
Validation of health outcomes. We assess the ecological validity of the data by comparing the grocery purchases with metabolic syndrome conditions that are strongly linked to food consumption habits. Specifically, we consider data about the prevalence of obesity and type-2 diabetes.   Nutrients and metabolic syndrome. According to the World Health Organization 49 , the three best dietary habits to prevent conditions associated with the metabolic syndrome are: i) limiting the intake of calories; ii) having a nutrient-diverse diet; and iii) favoring the consumption of fibers and proteins over sugars, carbohydrates, and fat. To verify whether the violation of these three recommendations is associated with an increased prevalence of metabolic disorders, we correlate diabetes and obesity prevalence with: i) the energy of the typical product (energy(a), Eq. (5)); ii) the entropy of energy intake from different nutrients (H energy-nutrients (a), Eq. (12)); iii) and the energy content of each individual nutrient (energy-nutrient i (a) Eq. (8)). Spearman rank correlations (R) between nutrients and prevalence of medical conditions across areas are summarized in Fig. 5.
The prevalence of obese and overweight children (Fig. 5, top row) is inversely correlated with the energy coming from fibers in the typical product, with correlations all lower than −0.5. We found a weaker inverse correlation R . We found a positive yet feeble association ≅ . R ( 01) with energy from carbohydrates.
Stronger correlations emerge for adults (Fig. 5, bottom row). Areas with higher prevalence of overweight and obese adults are those where the typical product is higher in energy ≅ . R ( 04) and is also high in fat, sugar, and carbohydrates ∈ . . R ( [0 35, 0 45]). Prevalence of adult obesity is negatively associated with nutrient entropy R ( 04) ≅ − . . The diabetes data yields the highest number of significant correlations (Fig. 5, bottom-right panel). High diabetes prevalence is found in areas where the food purchased is higher in energy coming from carbohydrates, sugar, and to a lesser extent fats R ( [0 35, 0 65]) ∈ .
. . Low diabetes prevalence is associated with high consumption of fibers and proteins R ( 05) ≅ . , especially in areas characterized by high entropy of nutrients = − . R ( 077). To build stronger evidence that the food descriptors provided in the dataset are directly linked to food-related illnesses-and not just proxies for confounding determinants of those ailments-we ran an ordinary least squares regression. The regression aims at predicting the prevalence of diabetes from the two highest-correlated factors (energy-carbs and H energy-nutrients ) and four control variables accounting for demographics and store penetration rates: average age, % of female residents, density of residents in the area, and number of transactions. Where necessary, predictor variables undergo a logarithmic transformation. We also applied a min-max rescaling of www.nature.com/scientificdata www.nature.com/scientificdata/ each variable, which allowed us to compare the relative importance of different factors in predicting the outcome. Results are summarized in Table 2. The model has a high goodness of fit (R 2 > 0.6) especially compared to a baseline model where only the control variables are considered (R 2 = 0.14). The magnitude of the regression coefficients indicates that all independent variables have predictive power, with the nutrient entropy being the strongest one.
A regression model that includes only energy-carbs and H energy-nutrients as independent variables retains a high explanatory power (R 2 = 0.56). We plotted the actual diabetes prevalence values against the values predicted by this 2-variable regression model in Fig. 6 (left). We checked how the standardized regression residuals-namely the absolute differences between the official diabetes estimates and the corresponding predicted values, divided by the standard deviation of all residuals-vary with area a's representa tiveness(a) (the ratio between the number of customers and the number of residents). To do that, we partition areas in equally-sized groups of areas with comparable values of representativeness, and average the standardized residuals within each group (Fig. 6 (right)). The prediction error is higher for areas in which the representativeness is lower (<0.1). Beyond that threshold, residuals quickly drop and stabilize.
The hierarchical arrangement of census areas (e.g., a borough contains a set of wards) calls for the use of hierarchical regression models. Specifically, we ran a Bayesian hierarchical regression 50 . In this type of regression, data is partitioned in groups; each group is modeled with its own set of regression coefficients expressed as probability distributions with normal priors. In turn, the parameters of such distributions are generated by higher-order probability distributions. In a Bayesian fashion, the distributions are learned from the data using Monte Carlo sampling methods 51 . We ran the hierarchical regression by grouping wards into their respective boroughs, and obtained a Bayesian R 2 52 of 0.78, suggesting that accounting for the hierarchical structures in the data increases the amount of explained variability.
Overall, the correlations and the regression results indicate that the aggregated food purchases are predictive of the illnesses related to food consumption, even after controlling for confounding factors, speaking to the ecological validity of our dataset.  Table 2. Linear regression to predict ward diabetes prevalence from carbohydrates (energy-carbs), nutrient entropy (H energy-nutrients ), and control variables. www.nature.com/scientificdata www.nature.com/scientificdata/ Usage Notes Data parsing. All files are provided in comma separated value format, with header on the first line and one record per line. This type of files is easily parsed with any programming language (e.g., R, Python) or spreadsheet utilities (e.g., OpenOffice Calc). Spatial definition. The spatial data ({aggregation}_grocery_info.csv files) can be integrated with other datasets using the geographical area identifier-the first column in every file-as join key. It is important to note that the boundaries of the geographical areas were periodically redefined in the past decades. In the dataset, we use the 2011 definition of LSOAs, MSOAs and Boroughs (the most recent one at the time of writing) and the 2015 definition of wards. The list of area ids over time and instructions on how to map data across aggregation levels are available on the official London Greater Authority atlas 43 . area filtering. In the Technical Validation Section, we showed that filtering out areas with lower coverage yields a sample that is more representative of the general population. The same type of filtering can be achieved by retaining only the entries with highest values of representativeness.

code availability
We provide the code we used to validate our data in the form of a python script.