Introduction

A substantial body of evidence points to the very early part of life being crucial in determining health in childhood and in later years. Exposures in childhood, such as parental social class, family income, parental employment, poor health, and adverse childhood experiences are associated with the risk of multimorbidity in adulthood, defined as the co-existence of two or more chronic conditions1,2,3,4,5. However, it is increasingly recognised that health is multidetermined and its risk factors are complex. Yet, many studies use a reductionist approach to isolate the influence of one specific factor on health outcomes rather than consider the combined effect of wider determinants from across the lifecourse6. This is in part because data limitations often preclude the analysis of multiple variables simultaneously, and because this multidetermined nature of health is challenging to characterise, map and explore. Additionally, authors sometimes explicitly reduce the number of variables chosen for analysis to reduce statistical complexity and computation that can result from highly correlated variables. Despite this, we know that risk factors for disease are interlinked and often cluster together7, so separating them into independent domains is challenging but should be considered. Therefore, researchers should examine clusters of risk, not just individual risk factors, which is the more traditional epidemiological approach.

Developing methods for capturing, mapping, and exploring how multimorbidity risk factors cluster in domains, as opposed to analysing individual exposures, can help develop more effective public health solutions. Identifying populations at higher risk and understanding how to address the clustering of vulnerabilities to future ill health requires approaches to data analyses that take account of the complexities of how risk is shaped and amplified across the lifecourse. In earlier work8, we conceptually identified 12 domains of early life risk factors as being important for multimorbidity risk. These domains were developed from a review of existing research evidence and policy documents that focussed on the early life determinants of multimorbidity, and through discussion with members of the public (through our public involvement work), who provided their thoughts on important early life determinants for future multimorbidity risk. This conceptualisation was built on the concept of lifecourse epidemiology, defined by Kuh and Ben-Shlomo9 as ‘the study of long-term effects on later health or disease risk of physical or social exposures during gestation, childhood, adolescence, young adulthood and later adult life’.

The 12 domains outlined in Fig. 1 included: Domain 1: Prenatal, antenatal, neonatal and birth (from conception to the first month of life) that focused on the period from preconception to the onset of labour, the circumstances and outcomes surrounding a birth, and the period immediately following birth. Domain 2: Adverse childhood experiences (ACEs) described negative experiences or events such as abuse, neglect, domestic violence, parental substance abuse, parental death, parental separation, and parental incarceration. Domain 3: Child health considered the health of a child from birth to age 18. Domain 4: Developmental attributes and behaviour (under the age of 18) focused on the developmental markers of children relating to cognition, coordination, personality types and behavioural traits, and included diagnosed neurodevelopmental conditions. Domain 5: Child education related to the process of learning and educational achievement, especially in educational settings, and the knowledge an individual gains from these educational institutions. Domain 6: Demographics referred to factors that described the size, structure, and distribution of populations. Domain 7: Transgenerational impact of parental health, behaviours and education referred to factors that can be transmitted across generations. Domain 8: Socioeconomic factors included factors concerned with the interaction of social and economic issues. Domain 9: The parental and family environment incorporated parental-child interactions and the interaction between children and the primary care giver, parenting styles, parental beliefs, attitudes and discipline, and wider family factors such as kin networks. Domain 10: Neighbourhood, the physical environmental and health care systems incorporated external factors relating to neighbourhoods and the physical environments. Domain 11: Health behaviours and diet incorporated common health behaviours and diet. Domain 12: Religion, spirituality and wider culture combined the role of religion, spirituality and wider cultural norms and attitudes on influencing health, health literacy, health behaviours and health care decisions. We conceptually identified these 12 domains of future multimorbidity risk a priori of how they are represented within data. Therefore, our aim in this paper is to explore how these 12 conceptualised early life domains of future multimorbidity risk are characterised and represented within available data in the UK.

Fig. 1
figure 1

The 12 domains of early life risk factors identified from the literature, policy and PPI contributions8.

This work is conducted as part of the Multidisciplinary Ecosystem to study Lifecourse Determinants and Prevention of Early-onset Burdensome Multimorbidity (MELD-B) project10, which aims to use an epidemiological and an artificial intelligence enhanced analysis of birth cohort and Electronic Healthcare data sources to identify lifecourse time periods and targets for the prevention of early-onset, burdensome multiple long-term conditions.

Methods

Data sources

To map the conceptualised domains to the available and relevant variables, it was important we consider data sources that captured a wide array of biological, social, environmental, and family variables from across childhood, and that follow up respondents across adulthood. Therefore, we focused on three UK longitudinal cohort studies. The Aberdeen Children of the 1950s (ACONF) includes children born in Aberdeen, Scotland, between 1950 and 1956; in total there are 12,150 cohort members, and participants were traced in their forties (2002) and linked to hospital and mental health admissions, maternity records, cancer registers, and death records11. The main ‘reading survey’ comprised of reading and maths test taken at school, and cohort members were linked to other school and birth records. A ‘family survey’ was administered randomly to 1 in 5 mothers of the cohort members regarding a range of topics, including the child’s medical history, mother’s attitude towards their child’s education and their aspirations for them, leisure activities of the child and parent, and housing conditions and social background of the parents11. The National Child Development Study (NCDS)12 follows all children born in England, Scotland and Wales in one week in 1958, and includes 17,415 cohort members. To date, there have been 11 sweeps of data collection—4 in childhood and 7 in adulthood, and a biomedical sweep of data collected was conducted via a healthcare visitor at age 44. The 1970 British Cohort Study (BCS70)13 follows 17,096 cohort members born in England, Scotland, Wales, and Northern Ireland in 1 week in 1970. To date, there have been 10 sweeps of data collection—4 in childhood and 6 in adulthood, and a biomedical sweep of data collected was conducted via a healthcare visitor at age 46. All three data sources have collected information on social, economic, biological, and environmental factors at various time points in childhood.

Analyses

The first stage involved a manual data audit that mapped the 12 conceptualised domains to the available and relevant variables in childhood across the three data sources. Initially, we reviewed all variables recorded in the BCS70 and NCDS at ages 10 and 11, and ACONF at ages 7–12. Each variable was placed into the domain they best represented, based on a discussion among the authorship team. We acknowledge overlaps between domains, and that some variables could have been included within more than one domain. However, we chose not to duplicate variables across domains to prevent multiple counting, and given duplicate variables would result in issues for future research such as regression modelling. Instead, we included a variable in the domain the research team mutually felt it best represented.

Variables were excluded from the data audit if they did not represent any of the 12 domains i.e., they were not relevant to any of the 12 domains. Given that the BCS70 and NCDS collected additional data at birth, age 5 and 16 (BCS70) and age 7 and 16 (NCDS), we expanded the data landscape audit to include variables reported in these additional sweeps to supplement domains that were either poorly represented or were of poor data quality such as high levels of missing, at age 10 (BCS70) and 11 (NCDS). A list of all the variables identified from the data landscape audit are included in Supplementary Tables 13.

After allocating all the variables identified from the data audit to a specific domain, we conducted an exploratory analysis to understand the relationship between the variables within a domain. This exploratory analysis included producing Pearson correlation coefficients and Principal Component Analysis (PCA).

Firstly, Pearson correlation coefficients were used for continuous variables to identify any highly correlated variables within a domain; this was an important step for reducing multicollinearity prior to any future regression modelling. We defined highly correlated variables as those with a correlation coefficient greater than 0.714. Any correlated variables were initially flagged prior to the PCA analysis. A strength of the PCA analysis is that it identified highly correlated variables and transformed these variables into a smaller set of variables, called principal components. However, if highly correlated variables identified from the Pearson correlation coefficients remained after the PCA analysis, we retained the flag in the data audit. This flag was retained to allow researcher to make their own decision on which variables to retain prior to any further modelling; we suggest that decisions on which variables to retain should be based on data quality (i.e., level of missing), the strength of the theoretical and/or conceptual reasoning for retaining a variable within a domain, and the specific research questions being addressed.

Secondly, PCA analysis was performed to help reduce the dimensionality of the data by categorising each of the 12 domains into mutually exclusive PCA components based on similar characteristics15. Individual PCA components were scaled to have mean zero, the data was automatically standardised to have unit variance, and PCA components within each domain were identified if they had an eigenvalue of above one16,17. If domains had multiple PCA components with an eigenvalue above one, we selected the top four components. Finally, we identified individual variables for a PCA component if they had a component score above 0.318,19. We defined dominant PCA components as the component that contributed the greatest proportion of the overall variance for each domain and a single data source. Therefore, dominant indicates a combined variable (PCA component) that provides the highest variation within a single data source for a single domain. The PCA components within each domain were given a descriptive name chosen by the research team that summarised all the variables within a component with a component score above 0.3. Given that PCA analysis does not perform well on categorical data, we used categorical PCA (CATPCA) sometimes known as ‘nonlinear PCA’, as an extension of PCA analysis to deal with both continuous and categorical variables20,21,22. CATPCA allows for the analysis of categorical variables by finding the optimal quantification of the categorical variables, which means assigning numerical values to the categories in a way that maximises the explained variance in the data, this method operates the same way as other methods such as mixed component analysis and PCA analysis, allowing for the principal components to be interpreted in the same way as in standard PCA22,23. The steps for CATPCA are as followed20,21,22:

  1. 1.

    The categorical variables are transformed into numeric values, typically using techniques like dummy coding or optimal scaling.

  2. 2.

    The transformed numeric data is then subjected to the standard PCA algorithm to find the principal components that account for the maximum variance in the data.

Analysis was conducted using STATA 17.0 and SPSS 29, and figures were created using online diagram software ‘drawio’24.

Results

Data landscape audit and domain data mapping

ACONF

After conducting the data landscape audit in the ACONF dataset, 74 variables were identified that represented 7 of the conceptualised domains (i.e., demographic, socioeconomic, developmental attributes, childhood health, education and health literacy, antenatal, neonatal and birth and neighbourhood domains) (Fig. 2). Several domains were represented by variables recorded in the family survey—a subset of the full ACONF sample (2208/12,150 cohort members). Therefore, a decision was made for those domains where data were present in two surveys (i.e., the demographic, socioeconomic, and developmental attributes), to only include data from the full ACONF sample. The only exception was the childhood health domain, which could only be captured in the family survey. The data audit did not find enough data to sufficiently represent the ACE domain; religion, spirituality, wider culture domain; parental-family factors domain; transgenerational domain and health behaviours and diet domain.

Fig. 2
figure 2

Data landscape audit mapping the available and relevant variables in ACONF to the conceptualised domain.

NCDS

As shown in Fig. 3, after conducting the data landscape audit in the NCDS dataset, 143 variables were identified that represented 10 of the conceptualised domains (i.e., demographic, socioeconomic, developmental attributes, childhood health, transgenerational, antenatal, neonatal and birth, neighbourhood, health behaviours and diet, parental-family relations and education and health literacy domains) (Fig. 3). As demonstrated, supplementary sweeps of data collection at birth, age 7 and age 16 were utilised to supplement the antenatal, neonatal and birth domain (birth sweep) the health behaviour and diet domain, and the transgenerational and neighbourhood domains (ages 7 and 16 sweeps), as these domains were not represented at age 11. The data audit in the NCDS dataset did not discover enough data to accurately represent the ACE domain or religion, spirituality, or wider culture domain.

Fig. 3
figure 3

Data landscape audit mapping the available and relevant variables in NCDS to the conceptualised domains.

BCS70

As shown in Fig. 4 after conducting the data landscape audit in the BCS70 dataset, 289 variables were identified that represented all 12 of the conceptualised domains. As demonstrated, supplementary sweeps of data collection at birth, age 5, and age 16 were utilised to supplement the antenatal, neonatal and birth domain (birth sweep), health behaviour and diet domain, and ACE domain (ages 5 and 16 sweep), as these domains were not well represented at age 10 in the BCS70. We also supplemented the age 10 transgenerational domain and neighbourhood domain with data recorded at ages 5 and 16, given these domains had high levels of missing at age 10.

Fig. 4
figure 4

Data landscape audit mapping the available and relevant variables in BCS70 to the conceptualised domains.

Correlation coefficients and PCA analysis

The full results from the PCA analysis are included in Supplementary Figs. S1S3. The component that contributed the greatest proportion of the overall variance for each domain (i.e., component 1) are highlighted in Table 1. Utilising PCA analysis to identify mutually exclusive groups reduced the dimensionality of the ACONF variables from 74 to 41. Dominant components based on the greatest contribution of the overall variance within a domain, included a physical grade component that contributed 15% of the variance to domain 1 (prenatal, antenatal, neonatal and birth). This component included variables relating to maternal age, ‘mother physical grade’ (condition of mother at birth) and ‘child physical grade’ (condition of baby at birth). A behavioural component that contributed 56% of the variance to domain 4 (developmental attributes) and included variables on ‘neurotic/anti-social rating’ and ‘total score scale b’ (a measure of child behaviour). An IQ and parental education component that contributed 30% of the variance to domain 5 (education) that incorporated variables focusing on school mean IQ and mother’s further education and father’s further education. A family size component that contributed 24% of the variance to domain 6 (demographic) included variables relating to position of index child and family size. A housing component that contributed 35% of the variance to domain 8 (socioeconomic) incorporated variables relating to person per room, housing tenure and housing area. An amenity in area component contributed 35% of the variance to domain 10 (neighbourhood, environment, and health care systems) and included variables relating to access to cold water, a bath and a shared WC, and whether a house is being rented from the council.

Table 1 The dominant component for each domain that contributed the greatest proportion of the overall variance (i.e., Component 1), for all three data sources.

Utilising PCA reduced the dimensionality of the NCDS data recorded at birth and ages 7, 11 and 16 from 143 to 73 variables. Dominant components based on the greatest contribution of the overall variance within a domain, included maternal fertility histories that contributed 14% of the variance to domain 1 (prenatal, antenatal, neonatal and birth), and variables within this component included maternal age, parity, and birth spacing. A ‘somatic’ and ‘infectious’ illness component that contributed 7% of the variance to domain 3 (child health), that incorporated variables relating to ‘somatic symptoms’ and ‘infectious illnesses’ reported in childhood. A balance component that contributed 25% of the variance to domain 4 (developmental attributes), and included variables assessing walking in a straight line, standing on left and right leg, and balancing heel to toe. An educational ability component contributed 59% of the variance to domain 5 (education), and incorporated variables focusing on general knowledge, number, book, oral, math and general ability tests, and reading comprehension. An ethnicity component contributed 19% of the variance to domain 6 (demographic), and included variables relating to language spoken in home, mother’s ethnicity, and father’s ethnicity.

Other dominant components in the NCDS included parental smoking that contributed 25% of the variance to domain 7 (transgenerational), and included variables related to parental smoking. A housing component contributed 20% of the variance to domain 8 (socioeconomic), and incorporated variables relating to household number, sharing a bedroom and number of persons per room. A parental-child interactions component contributed 18% of the variance to domain 9 (parental-family factors) and included variables relating to the mother and father going on walks with the child and father helping with managing a child. An access to green space component contributed 26% of the variance to domain 10 (neighbourhood, environment, and health care systems), and included variables relating to access to play areas, access to public parks and access to recreational grounds. Finally, an eating problems component contributed 24% of the variance to domain 11 (health behaviours) and included variables relating to any eating disorders and the type of eating disorders.

Utilising PCA analysis in the BCS70 and on data recorded at birth and ages 5, 10 and 16, reduced the dimensionality of the dataset from 289 to 149 variables. Dominant components based on the greatest contribution of the overall variance within a domain, included a maternal fertility histories component that contributed 16% of the variance to domain 1 (prenatal, antenatal, neonatal and birth), and variables within this component included maternal age, parity, and number of previous pregnancies. An unwanted sexual approaches (age 16) component contributed 11% of the variance to domain 2 (adverse childhood experiences) and included variables relating to the number of ‘unwanted sexual approaches in the last year’. A long-term illness component contributed 11% of the variance to domain 3 (child health), and included abnormal gastrointestinal finding, abnormal neurological findings, abnormal endocrine findings, and abnormal mental handicap findings. A coordination component contributed 18% of the variance to domain 4 (developmental attributes), and included variables focussing on hand coordination, being clumsy at games, difficulty picking up a objects and difficulty kicking a ball. An educational ability component contributed 23% of the variance to domain 5 (education) and incorporated variables focusing on reading and math tests, estimated reading ages, difficulty reading and writing, and reading ability. An ethnicity component contributed 35% of the variance to domain 6 (demographic) and included variables concerning child’s ethnicity, mother’s ethnicity and father’s ethnicity.

Other dominant components in the BCS70 included a parental health behaviours component that contributed 8% of the variance to domain 7 (transgenerational), and included variables related to maternal smoking, father’s smoking, mother’s healthy lifestyle and father’s healthy lifestyle. A parental social class and finance component contributed 18% of the variance to domain 8 (socioeconomic), and incorporated variables relating to parental social class, van/car ownership, family income and living on a council estate. A parental–child interactions component contributed 10% of the variance to domain 9 (parental-family factors) and encompassed variables relating to the family doing activities together (walks/outings/meals/holiday/shopping/restaurants) and families chatting for at least 5 min per day. A neighbourhood description component contributed 21% of the variance to domain 10 (neighbourhood, environment, and health care systems), and included variables concerning noisy neighbourhood, teenagers on streets, ‘drunks’ on the street and rubbish on the street. A drug use component contributed 10% of the variance to domain 11 (health behaviours), and combined variables relating to drug use (heroin/semeron/cocaine/downers/uppers). Finally, a religion important component contributed 22% of the variance to domain 12 (religion, spirituality, and culture), and incorporated variables in relation to the religion a person was born into, time spent on religion, if religion was important and if religious views are misguided.

Discussion

In this paper we conducted a data landscape audit across three UK longitudinal cohort studies to explore how early life variables map against a conceptual framework of lifecourse determinants of multimorbidity. We categorised three cohort datasets into mutually exclusive components based on similar within domain characteristics. Seven domains were characterised by 74 variables in ACONF recorded when participants were aged 7–12, ten domains were characterised by 143 variables in the NCDS recorded at the birth of the participant or at ages 7, 11 and 16, and twelve domains were characterised by 289 variables in the BCS70 recorded at the birth of the participant or at ages 5, 10 and 16. PCA analysis reduced the dimensionality of ACONF variables from 74 to 41, from 143 to 73 in the NCDS, and from 289 to 149 in the BCS70.

The data audit successfully mapped all 12 conceptualised domains8 to the available and relevant variables across the three datasets. For some domains we have partial coverage across the data sources—religion, spirituality, and wider culture domain (BCS70), adverse childhood experience domain (BCS70), child health including check-ups and screening domain (NCDS and BCS70) and the parental family factors and parental ability to care for a child domain (NCDS and BCS70). The remaining domains are represented across all ages in childhood within all three data sources. We acknowledge that our decision to map variables to the domain the research team felt they best represented does introduce some researcher bias into our analysis. We encourage researchers to consider our mapping as a guide to inform their own research rather than a conclusive, fixed list of variables. We would encourage researchers to adapt the domain mapping (where appropriate) in relation to the specific research question being addressed and associated methodology. In addition, we acknowledge there may be further variables that were unavailable in our datasets, but available in alternative datasets, and we would encourage researchers to incorporate these additional variables into future research.

Dominant components based on the greatest contribution of the overall variance within a domain, included maternal fertility histories within the prenatal, antenatal and birth domain (NCDS and BCS70), long-term or ‘somatic’ symptoms within the child health, including check-up and screening domain (NCDS and BCS70), IQ or educational ability within the child education and health literacy domain (ACONF, NCDS and BCS70). Other dominant components included ethnicity within the demography domain (NCDS and BCS70), parental health behaviours within the transgenerational impact of parent health and health behaviours domain (NCDS and BCS70), housing within the socioeconomic domain (ACONF and NCDS) and parental-child interaction within the parental family factors and parental ability to care for a child domain (NCDS and BCS70). It is important to acknowledge that this analysis has identified dominant components within 12 early life domain that have previously been found to be conceptually linked to the risk of multimorbidity8. It was beyond the scope of this descriptive paper to explore the relationship between domains or dominant components and risk of multimorbidity, and in the future research section we discuss next steps to explore these relationships.

Whilst we have demonstrated that there are some similarities in the dominant components across datasets, some differences in dominant components were to be expected. From a methodological point of view, the variables mapped to each domain were not identical across datasets. Additionally, some datasets had more data available than others, this was either because questions around specific domains were not asked, or in some instances the data that was collected was of poor quality. Conceptually from a cohort perspective we might expect some differences. Firstly, there were differences in the geographic location of the cohort studies, with ACONF being located in Aberdeen, Scotland and both the NCDS and BCS70 located across England, Scotland and Wales. Secondly, although the three cohorts are only 14–20 years apart, the childhood conditions the older cohorts (ACONF and NCDS) experienced in are arguably different to the younger cohort (BCS70). The younger cohort (BCS70) experienced increased women’s employment and economic independence, decreased family stability, educational expansion and a shift away from the male breadwinner nuclear family that dominated the family structure of the NCDS and ACONF cohorts25,26,27,28,29. The younger cohort (BCS70) also experienced generational pay progression, higher real household disposable incomes and an increase in home ownership25, but globalisation and a shift of income from labour to capital meant greater economic uncertainty and increased inequality29. Therefore, these factors might lead us to expect more heterogeneity in the early life experiences of the 1970 cohort compared to the 1958 and 1950–1956 cohorts.

We know that single exposures in childhood are associated with the risk of multimorbidity in adulthood1,2,3,4,5. However, it is complex and difficult to explore the combined action of health determinants across childhood both because data limitations often preclude the analysis of multiple determinants, and because such combinations are difficult to characterise, map and explore in terms of their components and timing. As a result, most studies have used a reductionist approach to isolate the influence of a specific factor on health outcomes rather than consider the multidetermined nature of combined factors6. We have demonstrated that the auditing, characterisation, and mapping of potential early life determinants of future health outcomes can be achieved if multiple large scale longitudinal studies are used. We have demonstrated the potential strength in using multiple prospective studies given that the data available can allow for researchers to consider the potential long-term effects of combined domains of risk relating to social, economic, and environmental exposures during early life. Although we note that this study was lacking consistency regarding the uniformity of variables across data sources.

Acting on these wider combinations of determinants from across childhood will not just improve health outcomes but have further intermediate benefits, such as narrowing social, environmental and health inequalities in childhood that may be on the pathway to health outcomes in adulthood. Conceptually multiple combined risk factors might lead to greater risk of single disease outcomes such as cardiovascular disease30,31, however the concept of developing methods for capturing the multidimensional nature of risk of multiple disease outcomes such as multimorbidity is a field of growing interest and research32,33. This mapping and characterisation of variables that relate to early life risk domains of multimorbidity can provide a step towards understanding and promoting population-level action to prevent or delay the burden of multimorbidity. This work also supports the drive within the UK healthcare system about shifting towards a more preventive model of health. The Department of Health and Social Care34 policy paper on transforming the public health system highlighted the need to focus on prevention and the wider determinants of health, and the 2018 paper on the Public Health Priorities in Scotland35 included the need to invest early in young people’s future as the best form of prevention. We therefore see our research as an extension to these discussions and a contribution to this emerging field, as we have demonstrated that through the effective utilisation and evaluation of birth cohort datasets, we were able to successfully audit and map how domains for prevention of multiple long-term conditions can be represented within data.

Future research

Given we have conceptually8 and now methodologically and descriptively captured potential domains of early life risk factors of future multimorbidity risk, our next step involves quantifying the association between these domains and/or potential interaction between domains with the risk of developing multimorbidity within the MELD-B project10. For example, within the MELD-B project we have used retained variables identified in this paper, and prediction modelling methods, to predict the risk of obesity-hypertension comorbidity for each cohort member within five of the conceptualised domains36. Regression modelling was then used to explore the association between domain-specific predicted risk scores and risk of obesity-hypertension, in turn this has helped to identify the most important early life domains for obesity-hypertension risk36. Given multimorbidity is a complex outcome to capture, we have considered obesity-hypertension comorbidity as an exemplar outcome and other research may wish to explore the relationship between early life determinants and different multimorbidity clusters.

The data audit is also useful for other researchers who wish to use cohort data. We have discussed that most previous research focuses on single exposure-outcome relationships, potentially to reduce statistical complexity, or to focus policy attention onto a specific aspect. However, children are likely to be exposed to combinations of risk factors across these 12 early life domains. Our research supports the need for disciplines, including public health and epidemiology to move beyond single-exposure analysis and incorporate information from multiple domains into the same analysis. This could help researchers to understand how different experiences across a range of social, economic and environmental domains may influence the risk of developing long-term conditions, and provide actionable insights into how to support people to live more healthily for longer.

Our aim in this paper was to provide a descriptive analysis to explore how well the 12 individual childhood domains of future multimorbidity risk could be characterised and represented within available data in the UK. We therefore restricted variables to one domain. However, future research could go beyond our paper (and aim) to explore the relationship between variables across domains. Finally, we have demonstrated and documented the array of important early childhood variables within a domain, and adapting the methods we have presented here to other research questions and topics could be used to address a range of research questions and issues, not just related to health.

Strengths and limitations

The three longitudinal cohort studies provide some of the richest and most in-depth data in the UK and allowed us to capture a wide array of biological, social, environmental, and family variables from across the whole of childhood. The same level of information would not have been available from electronic health care records in either primary or secondary care. Despite this, there were conceptualised domains of multimorbidity risk outlined in our previous work8 such as religion, spirituality and wider culture domain (NCDS & ACONF), ACE domain (NCDS & ACONF), child health including check-ups and screening domain (ACONF), parental family factors and parental ability to care for a child domain (ACONF), and health behaviour and diet domain (ACONF) that we were unable to analyse across all three datasets. This was either because they were of poor data quality or only reported in one of the cohorts. It is also important to note that variables were not identical across data sources, meaning direct comparisons should be interpreted with caution.

The three cohort studies are also representative of a cohort of children born in 1950–1956 (ACONF), 1958 (NCDS) and 1970 (BCS70), and as such these data sources largely lack ethnic diversity and do not reflect the population of the UK today. A further limitation of using historical data sources was that some of the variables should be interpreted in the historical context in which they were recorded, newer surveys such as the Millennium Cohort Study may have improved variables that better represent some of the domains.

We acknowledge that the methods analysis represents a simple and straightforward method to reduce the dimensionality of these data. More sophisticated methods could be used such as latent class analysis or clustering methods, some of which will be explored in future work as parts of the MELD-B project10. It is important to note that we define dominant combined variables (PCA components) as those that provide the most variation within a single data source for a single domain. It was beyond the scope of this paper to explore (a) whether the variable, component or domain we have identified are important predictors of future long-term conditions or (b) whether the component identified provides the most variation within a domain across the population as a whole/multiple datasets, because this depends very much on which variables were measured and available to compare.

Conclusion

We have demonstrated that if multiple large scale longitudinal studies are used, there is enough data available for researchers to move beyond a reductionist approach that isolates the influence of a specific factor on health to consider combined risk factors and we hope to demonstrate this in future research. Developing such approaches for capturing, mapping, and exploring the combined effects of early life risk factors for health as opposed to individual exposures can help to challenge traditional epidemiological approach to the aetiology of disease, and develop new ideas and solutions for the prevention of ill health. Further research should build on the data audit to explore the relationship between the domains we have identified and risk of developing multimorbidity.