Can big data solve a big problem? Reporting the obesity data landscape in line with the Foresight obesity system map

Background Obesity research at a population level is multifaceted and complex. This has been characterised in the UK by the Foresight obesity systems map, identifying over 100 variables, across seven domain areas which are thought to influence energy balance, and subsequent obesity. Availability of data to consider the whole obesity system is traditionally lacking. However, in an era of big data, new possibilities are emerging. Understanding what data are available can be the first challenge, followed by an inconsistency in data reporting to enable adequate use in the obesity context. In this study we map data sources against the Foresight obesity system map domains and nodes and develop a framework to report big data for obesity research. Opportunities and challenges associated with this new data approach to whole systems obesity research are discussed. Methods Expert opinion from the ESRC Strategic Network for Obesity was harnessed in order to develop a data source reporting framework for obesity research. The framework was then tested on a range of data sources. In order to assess availability of data sources relevant to obesity research, a data mapping exercise against the Foresight obesity systems map domains and nodes was carried out. Results A reporting framework was developed to recommend the reporting of key information in line with these headings: Background; Elements; Exemplars; Content; Ownership; Aggregation; Sharing; Temporality (BEE-COAST). The new BEE-COAST framework was successfully applied to eight exemplar data sources from the UK. 80% coverage of the Foresight obesity systems map is possible using a wide range of big data sources. The remaining 20% were primarily biological measurements often captured by more traditional laboratory based research. Conclusions Big data offer great potential across many domains of obesity research and need to be leveraged in conjunction with traditional data for societal benefit and health promotion.


Introduction
The basic drivers of obesity are simple (more energy consumed than expended); however, the aetiology is complex. It is now widely accepted that multiple factors, including physiological, social and environmental, act synergistically to drive obesity. These factors are often described as the 'obesogenic' environment (an environment that hinders sufficient physical activity and promotes excessive intake of food, thereby making obesity more likely). This may explain the limited successat a population levelof traditional approaches to obesity prevention and management, which have tended to focus on behavioural, educational and pharmacological factors. For this reason, many researchers and policymakers are now advocating for 'whole systems' approaches to obesity prevention and management, which promote integrated systems to address obesity, rather than focusing on risk factors in isolation [1][2][3].
In 2007, the Foresight Report-the most comprehensive UK investigation into obesity and its causes [1]-described obesity as a "complex web of societal and biological factors that have, in recent decades, exposed our inherent human vulnerability to weight gain". The investigation produced an obesity system map, with energy balance at its centre. Around this, over 100 variables are split into seven domains that are thought to directly or indirectly influence energy balance.
Knowing that individual (e.g., genetics, age, gender and ethnicity), social (income, education, area deprivation) and area factors (e.g., access to fast food, street connectivity) contribute towards obesity is useful [4][5][6] and has identified key areas to target prevention and/or treatment. However, the key is understanding the interplay between these factors, which is currently lacking. The complex, nonlinear and unpredictable relationships of how systems interact will offer insight into the development and evaluation of systems based approaches, moving away from siloed thinking [7]. Data required to fill gaps in traditional resources and to enable research using a whole systems approach are inherently difficult to collect, especially on a large scale. For this reason, new and emerging data sources are increasingly gaining attention.
Internationally, a broader understanding of risk factors for obesity and increasing awareness of the social determinants have led to recognition of the need for more comprehensive, cross sectoral strategies to tackle obesity [8].
The preceding paper in this series [9] reviewed how 'found' data sources, often referred to as 'big data', have been utilised in the literature to better understand obesity. Data on our activity, behaviour and location, from sources as diverse as smart motorways, social media, store loyalty cards and consumer organisations, have been shown to offer fruitful research opportunities, contributing in ways where traditionally sourced research data perhaps could not.
This second paper from the Economic and Social Research Council (ESRC) Strategic Network for Obesity builds on this description of how 'big' or 'found' data has been used to date, and considers the future potential of these data to enhance a 'whole systems' understanding of obesity. Identifying new types of data becoming available and mapping these onto the domains defined in the Foresight Obesity System Map should reveal the extent to which such data may be capable of addressing the whole system.
One challenge that potentially precludes the use of big data to their full potential is a lack of awareness and understanding around what data exist. Aspects of these new big data sources, such as the volume, variety, velocity and veracity are often challenging to conceptualise and capture.
It is therefore also crucial that ways are found to communicate the potential (and limitations) of new, as yet untapped, data sources, across disciplines and sectors, to facilitate the move towards a whole systems approach to obesity.
In this paper, we aimed to: 1. Develop a framework in which to effectively report big data for use under a whole systems obesity lens. 2. Use the new framework to report indicative exemplar data types in line with the Foresight obesity system map domain areas. 3. Identify big data sources for use in whole systems obesity research and map these against the Foresight obesity systems map. 4. Discuss key challenges associated with using new and large data sources to analyse obesity from a whole systems perspective.

Methods
This paper is one output from a collaborative network of academic researchers, industry partners, charity representatives and members from the public sector. We convened 40 members and hosted 5 network meetings between 2015 and 2017 Editorial is 2018IJO00672R. During these meetings, members shared: experiences of using big data for obesity research, knowledge of suitable data sources, and expert opinion on how to optimise this wealth of data. For the purpose of this paper, big data were defined synonymously with 'non-traditional' data; in other words, any data not collected specifically for academic research purposes. Synthesis of expert opinion on optimising data, culminated in the development of a reporting framework. The purpose of this framework was to outline a structure for reporting the features of big data in obesity research, although its application may be valid outside the obesity research area.
The reporting framework was then applied to eight exemplar data sets, to demonstrate its usefulness in communicating important data details. The foresight obesity systems map was indexed with domain and node identifiers (presented in the supplementary material). We use indicative use cases to present the relevant foresight nodes within the exemplar reporting.
In order to illustrate the potential scope and depth of big data, a list of potential data sets available in the UK, were mapped against the nodes of the Foresight obesity systems map. The objective was not to perform a comprehensive audit, which would quickly become outdated, but rather to demonstrate the potential value and opportunity of big data as a resource in understanding the obesity system. The list of data sets was thus a convenience sample, based on data sets that were familiar to network members. The mapping exercise was supplemented with one more traditional and comprehensive dataset, the UK Biobank cohort, to highlight how combination of different types of data might be used together.
For other application areas (i.e., not obesity) this exercise could be repeated with reference to another domain specific theoretical framework.

BEE-COAST framework
ESRC Strategic network for obesity members agreed that, for all new data sources, it is essential to provide some background (B) on the history and purpose of how and why the data are generated, including key features of the data. This is especially important when data are used in a context for which they were not initially collected. Detailed description of the Elements (E) of the data that are required for others to fully understand their potential application. These Elements encompass detailed Content (C), Ownership (O), level of aggregation, for example individual, neighbourhood, regional or national (A), conditions related to Sharing (S) and Temporality (T) of the data. Finally, these datasets should be illustrated using Exemplars (E) to include the format of the data and indicative use cases (Table 1).

Reporting data sources
Eight exemplar data sources were included in this review as providing valuable data for use in a whole systems obesity research: (i) Ordnance Survey Points of Interest data; (ii) Food Standards Agency food hygiene data; (iii) supermarket loyalty card data; (iv) physical activity applications/ wearables; (v) new technologies to record diet; (vi) Acxiom data; (vii) Cameo data from Callcredit and (viii) YouGov data. The features of these datasets are summarised in Table 2 in accordance with the BEE-COAST reporting framework. Collectively, the exemplar data sets were found to map to 56 Foresight nodes, covering all 7 Foresight domains.

Mapping data sources to Foresight domains and nodes
The list of data sources and how these map onto the Foresight nodes can be seen in Table 3. Overall, 86/108 of the Foresight nodes are covered by at least one big dataset. When traditional cohort data is also included (UK Biobank), this increases to 89/108 nodes. Table 4 highlights the areas in which big data are, to the best of our knowledge, not readily available to map against Foresight domains and nodes. We believe that information relating to many of these nodes would typically be generated by research studies, which often recruit, relatively speaking, small number of participants. Whilst participant numbers may be small, the number of data points may be large. Some of the nodes would likely require qualitative research to capture relevant data.

Discussion
The aim of this paper was to explore the potential role of socalled 'big data' in a whole systems approach to obesity. By mapping a small but varied selection of emerging data types onto the Foresight obesity system map, it is apparent that big data span 80% of nodes, and therefore could prove important in providing the breadth and depth of physiological, social, and environmental information needed to simultaneously examine inter-related risk factors for obesity in different populations and across multiple levels. Through this mapping exercise we highlight the wide variety of data which could be better exploited alongside existing research or for new, interdisciplinary, obesity research questions.
Data which span the whole of the obesity system are difficult and time-consuming to collect, particularly on a large scale. Big data have been heralded as a potential solution to this problem, with such data being generatedlargely passively-at an ever-increasing rate and across a range of contexts. This is the first time the potential for big data has been evaluated in a whole systems context. Our data audit has shown the potential value of big data within

History
Under UK law any business intending to conduct 'food operations' (including selling, cooking food, storing, handling, preparing or distributing food) must register their business with the environmental health department of their Local Authority (LA). This is then used by the environmental health team to conduct food hygiene inspections and enforce food law. The register is updated by a LA when a business registers its intention to conduct food operations, and businesses are removed when registered businesses inform a LA of their intention to terminate food operations. Data are also updated when environmental health officers conduct food hygiene inspections. The frequency of such inspection will depend on the initial food hygiene rating assigned to the business

Sharing
The CDRC has a licence for data in three key areas of mobility, retail and sustainability. The variables relevant to health include product consumption (e.g., meat, vegetables, alcohol, carbonated drinks, confectionery and snacks); eating habits (self-classified) and concerns about food (e.g., salt, sugar, fats, gluten). Commissioned tables can potentially be generated at a modest but commercial rate Temporality Data are updated quarterly.

Exemplars
Indicative use cases YouGov data have been regularly used [34]. Current work is considering the relationship between supermarket accessibility and electronic delivery of groceries, in which individual level choices are a useful feature. Indicative use cases Exploration of the Acxiom data in the context of household migration has been undertaken by Thomas (2014) [35]. Use of the data in the context of retail consumption in times of austerity and the "credit crunch" have been considered by Thompson (2013) [36] and Clarke (2015) [37]. These academic studies have explored and reweighted for skews and variable quality of the individual returns   Whilst the broad coverage of big data across the obesity system map offers exciting possibilities for research, it is important to acknowledge that big data are not the complete solution to a whole systems approach. The remaining 20% of nodes on the Foresight obesity system map were not directly featured in our data mapping exercisefor example, genetic and physiological variables relating to appetite control, metabolic rate and predisposition to obesity. Many of these unmapped nodes represent data that are commonly collected in traditional research, and recent large-scale initiatives (e.g., UK Biobank [10,15] and other ongoing longitudinal cohort studies [11][12][13]) will continue to contribute important large-scale data. This suggests that big data should be used to supplement and enhance traditional datasets. Indeed, this paper does not advocate for the use of big data in place of traditional data, but rather to complement traditional data and of course be considered in the context of the research [14].
An important aim of this paper was to develop and demonstrate a framework (BEE-COAST) for reporting big data that describes emerging data through a whole systems obesity lens. The framework was shown to successfully summarise and communicate the important features of a number of data sources, including vital information about ownership and sharing, as well as content. It is suggested that this framework should be used to report big data sources used in research. It is also proposed that this framework could be used to develop a reference list of big data sources as a resource for future research, akin to published data resources profiles. Its application may also be valid outside the obesity research area.
While the BEE-COAST framework goes some way to elucidating the available 'big data' for obesity research, further data sources still need to be made available to increase coverage of the Foresight obesity system map. Increased multidisciplinary may facilitate this. For example, while our data audit did not highlight much data relating to food production, data are certainly being captured across this domain, for example by satellites, the instrumentation of farms, and data driven control within the manufacturing process. Such sources are not necessarily evident or accessible to a research community around obesity. Interdisciplinary networks such as the ESRC's Strategic Network for Obesity, and repositories such as the CDRC, provide a long-term opportunity to ameliorate this difficulty.
We have seen that variables as diverse as physical activity behaviours, built environment features, food consumption and choice are all richly captured by emerging sources of data. However, in spite of the existence of these data, we may not yet be in a position to utilise them to their full potential due to restrictions around data access and linkage. While individual-level data exists (e.g., relating to physical activity behaviours), at times these data cannot be released at an individualised level due to confidentiality and anonymity restrictions. Individual-level linkage is only possible with explicit consent. This has implications for data linkage, as data are often released (and thus must be linked) at a neighbourhood level, or larger, rather than at the individual level. Such area-level linkage is less than ideal, as potentially important within-neighbourhood variability is lost, and analyses are subject to bias (e.g., the ecological fallacy). Innovative approaches to data sharing and linkage  [15] overcomes this problem by releasing individual-level addresses for linkage with other datasets prior to the release of the main cohort data. In this model, the researcher can link the addresses with any secondary datasets, and then return this linked data to UK Biobank to be joined with the cohort data. Finally, the cohort data and linked secondary data are released back to the researcher, with the participant addresses removed. This process was possible because participants explicitly consented to take part in the UK Biobank study, which includes health data linkage. It is important to note that we have not explored the finer details of how such data sources might be linked and harmonised for research purposes.
Where data linkage is often feasible and legal, whether it is ethical to perform such linkages is a wider consideration [16,17]. For example; users of fitness tracking devices may have consented within their terms and conditions to sharing of their data with trusted partners. However, could they reasonably have expected that these partners would combine these physical activity records with food purchase transactions and their health outcomes? In many cases the benefit to society from such research may be argued to outweigh the risk of identification of the individual, but does that mean we should link data in this way, and would it, or should it, be permitted by research ethics committees responsible for granting ethical approval for research? The role of the ethics committee is essential to protecting the interests of the public and the research community. Should ethical standards not be maintained there is a risk of public outcry, which could prohibit future research of this type. Worse still would be for individual-level information to leak outside the research communities, for example to insurance companies, who might penalise their customers.
In relation to data access, the ownership of data is another key issue. Supermarkets may be keen to share data with academic researchers if there is hope of serious insights into store planning or optimisation of marketing spend. Whether they are also keen to share data to understand negative health consequences from retail sales is a somewhat different proposition. The ability to document the ownership of data sources is not always straightforward. For example; Food Standards Agency (FSA) data may be hosted and accessed via a local authority, but whether local authorities or the FSA are the data owner is debatable. This is likely the case for other data sources accessed via a third party.
These issues around access, linkage and ethics are echoed in the literature: most of the studies published to date that have attempted to utilise 'found' data describe challenges relating to these concerns [9]. It is encouraging however, that solutions to these challenges have been found, illustrated by the publication of such studies. Sharing best practice between research teams and organisations, relating to these challenges, presents opportunity to progress with a new type of research more pragmatically and efficiently.
In this paper we have considered how the Foresight obesity system map might be more fully populated through extraction of big data sources. However, the transformative effects of big data are potentially much more wide-ranging. A primary example of this is in the field of Randomised Controlled Trials (RCTs), where there is a growing feeling that a combination of new datasets at scale, perhaps ranging from patient data, hospital outcomes and prescriptions to lifestyle, activity, eating and food purchases might be combined to create a massive population base for future trials. Such an approach could be cost efficient in targeting participants, it would allow substantial samples to be identified for even the rarest conditions, and potentially admit variations between focused sub-groups e.g., within a specific age range or ethnic category. Again, the ability of trials to utilise such data is largely dependent upon the availability and accessibility of individual level data.
Innovative approaches to research questions are required in changing political landscapes and big data presents valuable possibilities. While RCTs are heralded as the gold standard in study design they are not applicable to all research questions. Many research questions relating to obesogenic factors (e.g., social and built environmental variables) are better suited to observational rather than experimental. This challenge is compounded in that big data is generally inherently observational rather than experimental. Thus we may need to look to alternative study designs combined with alternative and innovative methods of analysis. In particular, big data presents valuable possibilities for natural experiments to compare the experience of similar groups under different environmental conditions or subject to different interventions e.g., in different regions [18][19][20][21]. Spatiotemporal patterns can be investigated at scale for the first time without the challenges associated with longitudinal cohort study design and follow-up.
The above notwithstanding, a plethora of approaches, methods, metrics and variables are already being used in studies that make cross-comparison difficult -even impossible -and so the search for definitive evidence difficult. It is also important to maintain scientific rigour and a critical perspective and humility; employing a priori hypotheses where relevant, or acknowledging hypothesis generation where this alternative is used. Current practices of reporting statistical significance are in urgent need of refreshment because large sample sizes will always produce highly significant results and thus reporting of effect sizes and clinical meaningfulness is essential. Heterogeneity in data collection methods and resulting biases must be considered and acknowledged. This may be further inflated through combining multiple data sources. The use of big data does not preclude the need for validation of findings, whether that is through use data generated from a 'gold standard' or using better understood traditional data sources. Multiple big data sources, combined with traditional datasets offer opportunities for cross-validation, which is especially important when findings result from hypothesis generation. Given the many strengths of big data, we may need to accept these limitations as a necessary compromise. However, newly developing machine learning methods, and new strategies for causal inference with observational data may be part of a solution to these challenges [22].
In conclusion, big data offer great potential across many domains of obesity research and need to be leveraged for societal benefit and health promotion. While obesity research and policy have evolved towards a 'whole systems' paradigm since the publication of the Foresight Report, they still tend to focus only on small parts of the obesity system in isolation, and fail to consider the interrelationships between different factors. Use of big data could facilitate understanding of the wider determinants of obesity and their interrelations across multiple levels. In turn, this would permit evidence-informed allocation of funds and ultimately optimise return on investment during a period of financial constraint. This is particularly timely in light of the Government's childhood obesity policy published this year, which, in spite of identifying 14 specific levers for change, found a best-case summary of existing evidence-base to be 'equivocal'.