Large-scale systematic analysis of exposure to multiple cancer risk factors and the associations between exposure patterns and cancer incidence

Exposures to cancer risk factors such as smoking and alcohol are not mutually independent. We aimed to identify risk factor exposure patterns and their associations with sociodemographic characteristics and cancer incidence. We considered 120,771 female and, separately, 100,891 male participants of the Australian prospective cohort 45 and Up Study. Factor analysis grouped 36 self-reported variables into 8 combined factors each for females (largely representing ‘smoking’, ‘alcohol’, ‘vigorous exercise’, ‘age at childbirth’, ‘Menopausal Hormone Therapy’, ‘parity and breastfeeding’, ‘standing/sitting’, ‘fruit and vegetables’) and males (largely representing ‘smoking’, ‘alcohol’, ‘vigorous exercise’, ‘urology and health’, ‘moderate exercise’, ‘standing/sitting’, ‘fruit and vegetables’, ‘meat and BMI’). Associations with cancer incidence were investigated using multivariable logistic regression (4–8 years follow-up: 6193 females, 8749 males diagnosed with cancer). After multiple-testing correction, we identified 10 associations between combined factors and cancer incidence for females and 6 for males, of which 14 represent well-known relationships (e.g. bowel cancer: females ‘smoking’ factor Odds Ratio (OR) 1.16 (95% Confidence Interval (CI) 1.08–1.25), males ‘smoking’ factor OR 1.15 (95% CI 1.07–1.23)), providing evidence for the validity of this approach. The catalogue of associations between exposure patterns, sociodemographic characteristics, and cancer incidence can help inform design of future studies and targeted prevention programmes.

Lifestyle factors such as smoking, alcohol intake, diet and physical activity play a major role in the aetiology of different cancers 1 . However, exposures to these lifestyle factors are not independent of each other-for example, there are known links between exposures to raised Body Mass Index (BMI), lack of exercise, and poor diet, and thus it is unlikely that these exposures will have isolated effects on health 2 . It is therefore important to establish the relationships between different risk factors, identify exposure patterns and their sociodemographic associations, and examine the joint associations of exposure patterns with cancer incidence, so that cancer risks can be better understood and addressed.
Factor analysis is a statistical approach that condenses multiple individual lifestyle risk variables into a smaller set of so-called "latent factors" (labelled "combined factors" in this paper) which capture variation in individual lifestyle risk variables. A number of previous cancer risk studies have applied factor analysis to diet and nutrition variables (e.g. [3][4][5][6], and separately, to reproductive variables (e.g. 7 ). Factor analysis is related to latent profile models, which have also been applied to lifestyle information (e.g. 8 ). The main difference is that latent profile models assume latent variables are categorical (e.g. present or absent) and correspondingly seek to divide individuals into discrete separate groups based on their lifestyle (e.g. "High risk" versus "Low risk"). By contrast, factor analysis considers continuous latent variables and returns a continuous score for each individual and each www.nature.com/scientificreports/ varimax". This yielded the "combined factors". We found high agreement between the results from the discovery and validation datasets (Supplementary Note), and used the loadings from the discovery dataset in subsequent analyses.
Imputation of missing information. Missing data for cancer risk variables were imputed using a a nonparametric random forest method, applying the function "missForest" in the R package "missForest" 14 , with option variablewise = TRUE. Computation was parallelised by randomly splitting the discovery and validation datasets for males into 10 subsets each (9 subsets with 5000 individuals, plus remaining in subset 10). For females, the discovery and validation datasets were analogously split into 12 subsets each. Information was then imputed within each subset. This procedure was repeated 10 times, to yield 10 fully imputed datasets. We checked that imputation of the missing data did not change the mean or range of any variables.
Calculation of combined factor scores. For each fully imputed dataset, we calculated factor scores for all individuals using the function "factor.scores" in the R package "psych" 15 with option method = "Thurstone". This method calculates the regression based weights as W = R −1 F, where R is the correlation matrix and F is the factor loading matrix 16 . The factor scores are then obtained as S = ZW, where Z is the matrix of standardised observed variables. For each participant and each combined factor, the score was calculated as the mean of the scores from the 10 imputations ( Supplementary Fig. S2).
As there are different approaches for obtaining factor scores, each seeking to minimise a particular estimate of error, as a sensitivity analysis, we also calculated scores using the method = "Anderson" option. This method calculates weights such that the factor scores are uncorrelated as W = U -2 F(F'U -2 RU -2 F) -1/2 , where R and F are as defined above and U is the diagonal matrix of uniquenesses 16 . Based on the individual across-imputation mean scores, the correlations between the Thurstone and Anderson methods were extremely high (Pearson r 0.985-0.999), so scores based on the Thurstone method were used in subsequent analyses.
Study sample for association analyses. Cancer incidence data for 2006-2013 were obtained from linkage to the NSW Cancer Registry (Supplementary Note, Fig. 1), using corresponding ICD-10-AM topological codes for all invasive cancers (C00-C96, D45-47.1,47.3-47.5), and for lung (C34), bowel (C18-C20), breast (C50), prostate cancer (C61), and melanoma (C53). The cancer incidence data included the month and year of diagnosis. To calculate the time between baseline questionnaire and cancer diagnosis, the day of diagnosis was set to 15. This resulted in 4-8 years of follow-up data (median 5.4 years, 25-75% range 5.3-5.9 years for 210,471 participants included in the association analysis for cancer incidence, see below and Fig. 1).
Associations between combined factors and health, ancestry, and socioeconomic characteristics. We tested the association between each combined factor and age at baseline, as well as key health, ancestry, and socioeconomic characteristics (Supplementary Table S2). We used linear regression for each combined factor with all health, ancestry, and socioeconomic characteristics in a joint model. We defined significance at P < 0.001 to account for multiple testing (sensitivity analyses see Supplementary Note).  www.nature.com/scientificreports/ Associations between combined factors and cancer incidence. We tested the association between each combined factor and cancer incidence (separately for all cancers, and for lung, bowel, breast, prostate cancers, and melanoma) using logistic regression. In each logistic regression analysis, cases were participants newly diagnosed with cancer after recruitment (separately for all cancers, lung, bowel, breast, and prostate cancer, and melanoma), while all other participants were included as non-cases. We applied the function "glm" in R with option family = "binomial" to estimate odds ratios (ORs) and the function "confint.default" to obtain 95% confidence intervals.
The covariates included were age, BMI, private health insurance, remoteness of residence index (ARIA) 17 , self-reported health rating, and number of GP visits in the 2 years prior to baseline (Supplementary Tables S2,  S3). To capture GP visits, we used Medicare claims records and excluded 4759 female and 5778 male clients of the Australian Government's Department of Veterans' Affairs (DVA), as their healthcare is covered by a different billing system and may not be fully captured in the databases available for the 45 and Up Study cohort. DVA clients were identified through self-report in the 45 and Up Study baseline questionnaire, or through any mention Table 1. Characteristics of the 45 and Up Study cohort at baseline, including age and all cancer risk variables used in the factor analysis. $ Missing post QC Missing values after exclusion of outliers (see Supplementary  Table S1). $$ sd standard deviation. ^ IQR interquartile range (25%-75%).

Characteristic
Questionnaire item or definition www.nature.com/scientificreports/ of DVA coverage in a hospitalisation or emergency department presentation record. GP visits were identified using the MBS data (item codes 3-51). We also adjusted for self-reported pre-baseline cancer screening: mammographic screening for breast and all cancers for females, prostate-specific antigen (PSA) testing for prostate and all cancers for males, and bowel screening for bowel and all cancers for males and females. For analyses of melanoma risk, we further adjusted for skin colour, tannability, and average daily hours outdoors.
We conducted two sensitivity analyses: testing all combined factors jointly; excluding all individuals with cancer diagnosed in the first year after the individual's baseline questionnaire. Statistical significance was defined as P < 0.00125 in the main analysis (Bonferroni correction for 40 tests per gender), also requiring P < 0.05 in both sensitivity analyses.
We also verified that the estimates for the factor effects from logistic regression were not substantially different when additionally adjusted for highest educational qualification, income, and the relative socio-economic disadvantage index for areas (SEIFA, as calculated by the Australian Bureau of Statistics).
To further verify the results, we also carried out a survival analysis using competing risks regression for cancer incidence with death as the competing risk ("proportional sub-distribution hazards" regression model described by Fine and Gray 18 ). As with the logistic regression approach, we tested each combined factor separately. In a sensitivity analysis, we also tested all combined factors jointly. Significance was defined as P < 0.00125 in the main analysis, with a further requirement of P < 0.05 in the sensitivity analysis. These analyses were done using the function "crr" 18 in the R package "cmprsk", with 95% confidence intervals for estimates obtained using the function "summary.crr".
We note that competing risks regression has the advantage of explicitly taking into account follow-up time for individual participants, but the sub-distribution hazard includes individuals who have died in the risk set for cancer diagnosis 19 . This can cause difficulties in interpretation, hence logistic regression was presented as the main analysis, and all results were verified using competing risks regression.
Tests for interaction. Exposures to different cancer risk factors can have synergistic effects on cancer risk, for example, as found for smoking and alcohol for cancers of the upper aerodigestive tract 20 . Similar to comprehensive, non-hypothesis-driven assessments of individual risk factors, it is also of interest to examine potential interactions between pairs of risk factors to help identify areas for further investigation. However, large sample sizes are required for statistical interaction tests, and the multiple-testing correction required to systematically examine interactions can be prohibitive when examining many pairs of risk factors. Here, we leveraged the dimensionality reduction offered by the use of combined factors to test for interactions in a staged approach.
First, for cancer incidence, we tested interactions between combined factors using logistic regression as described above and including the interaction terms between pairs of combined factors. We only tested interactions between combined factors that were significantly associated with incidence of the same cancer type, and for that cancer type only (9 interactions for females, 3 for males; Supplementary Note).
Second, to further investigate an interaction between 'alcohol' and 'menopausal hormone therapy (MHT)' combined factors, we also tested for interactions between each of the two original alcohol variables with each of the two original MHT variables, using the same approach as for the combined factors. When analysing the original variables, we carried out tests based on the original data with exclusion of missing values. We verified that similar results were obtained when using across-imputation means from missForest imputation of missing data. Finally, we carried out a stratified analysis of breast cancer risk by baseline MHT status (never/former/ current use) for all females and, separately, for post-menopausal females. In each stratum, we separately tested associations between breast cancer incidence and each of the 'alcohol' combined factor and both original alcohol variables. Table S4).

Correlations between cancer risk variables. We calculated pairwise correlations between 33 variables for females and 28 variables for males (Supplementary
The highest correlations were observed between variables in the same domain (e.g. smoking behaviour: years smoked and number of cigarettes per week). We also observed correlations between smoking behaviour and consumption of alcohol (positive), fruit (negative), and breakfast cereal (negative). While most of these correlations were relatively weak, some of them were almost as strong as correlations between related variables such as fruit and vegetable consumption, or red meat and processed meat consumption.
Most correlations were similar for females and males (see Supplementary Note for description of differences).

Identification of combined factors representing exposure patterns.
For females, factor analysis identified 8 "combined factors" that capture the variation in the original 33 variables and reflect exposure patterns. We labelled each combined factor based on the original risk variables with the strongest absolute loadings (Fig. 2a, Supplementary Table S5): 'smoking' , 'alcohol' , 'vigorous exercise' , 'age at childbirth' , 'Menopausal Hormone Therapy (MHT)' , 'parity & breastfeeding' , 'standing/sitting' (more time standing and less time sitting), and 'fruit & vegetables' . We refer to the combined factors by their label as e.g. 'smoking' factor. We note that while the labels reflect the strongest absolute loadings, each factor also captured some information from other variables. For example, the 'smoking' factors for both females and males also captured some information on alcohol and breakfast cereal consumption. Eight combined factors were also identified for males (Fig. 2b) www.nature.com/scientificreports/  www.nature.com/scientificreports/ Combined factors with the same label for females and males may have different loading contributions from the original risk variables, due to differences in strengths of correlations. For example, for males, there was a stronger correlation between red meat and alcohol consumption, therefore a larger loading of red meat in the 'alcohol' combined factor (Fig. 2).
Associations between combined factors and health, ancestry, and socioeconomic characteristics. The associations between each of the combined factors and age, ancestry, health, participation in cancer screening, family history of cancer, and socioeconomic characteristics are shown in Fig. 3. We detected associations between the self-reported health rating and most of the combined factors, even when accounting for all other characteristics (i.e. age, ancestry, participation in cancer screening, family history of cancer, and socioeconomic characteristics). In particular, we identified associations between poorer health rating and higher 'smoking' , 'urology & health' and 'meat & BMI' factor scores, as well as lower 'alcohol' , 'vigorous exercise' , 'moderate exercise' , 'age at childbirth' , 'standing/sitting' , 'fruit & vegetables' factor scores. We detected several associations with self-reported ancestry, including lower 'smoking' and 'alcohol' factor scores with Chinese ancestry; lower 'smoking' and higher 'fruit & vegetables' factor scores with Australian ancestry; higher 'alcohol' and, for females, higher 'smoking' factor scores with Irish ancestry; and lower 'alcohol' factor scores with Greek ancestry. Interactions between combined factors associated with cancer risk. We found a possible interaction between 'age at childbirth' and 'MHT' factors for lung cancer incidence for females [adjusted odds ratio (OR) 1.17 (95% confidence interval (CI) 1.02-1.33); Supplementary Table S7]. However, we also found that smoking was higher among current than former and never MHT users at baseline (Supplementary Table S8). The interaction between 'age at childbirth' and 'MHT' factors was attenuated when also adjusting for the 'smoking' factor. Hence this interaction was not investigated further.
We also found a possible interaction effect between 'alcohol' and 'MHT' factors for breast cancer incidence [adjusted OR 1.06 (95% CI 1.00-1.12), p = 0.046; Supplementary Table S7]. To follow up this result and appropriately consider menopausal status, we focused on females post-menopause at baseline, and stratified them by never/former/current MHT use at baseline. The association with the 'alcohol' factor was strongest for current MHT users (Table 2), with similar results when using the original variables of weekly alcohol drinks and days drinking alcohol. Unfortunately, data on MHT type were not available to stratify the cohort further, and it is known that the association between MHT and breast cancer incidence varies substantially by MHT type 27 . Moreover, current MHT users also reported higher alcohol intake, and the confidence intervals for odds ratios overlapped between strata, hence these results are interpreted as suggestive only.
Since previous studies reported interactions between MHT use and BMI [28][29][30] , we examined the association between BMI and breast cancer risk stratified by MHT status (Table 2). BMI was associated with breast cancer incidence for never MHT users and former MHT users, but not for current MHT users at baseline, as also reported previously 28 .

Discussion
We have systematically examined the pairwise correlations between 36 cancer risk variables for over 220,000 Australian residents, and identified 8 "combined factors" each for females and for males, which capture exposure patterns. We detected extensive associations between the combined factors and sociodemographic characteristics such as self-rated health, medical history, family history of cancer, participation in cancer screening, ancestry, private health insurance, income, education, area-based socio-economic disadvantage, and remoteness of residence. We also identified 16 significant associations between the combined factors and cancer incidence, of which 14 represent well-known relationships, providing evidence for the validity of this approach.
The comprehensive characterisation of correlations between over 30 cancer risk exposures (and thus their degree of co-dependency) in this study has a range of important applications, from studies of cancer risk, to microsimulation modelling and the design of interventions.
Correlation between cancer risk exposures can lead to confounding in studies of cancer incidence, leading to e.g. possibly spurious associations between smoking and breast cancer due to confounding by alcohol consumption 31 . For future studies focused on specific single exposures, the correlations with other exposures provided in this study will allow better identification and examination of possible confounders. Similarly, the atlas of associations between combined factors and sociodemographic characteristics can also help to identify possible confounders for future studies of cancer risk.
Knowledge of relationships between risk factor exposures is also crucial for microsimulation modelling, which simulates millions of individuals in a population to forecast future disease burden and the effects of interventions. For cancer risk, current models typically only simulate an overall underlying cancer risk, e.g. [32][33][34] or only one risk factor 35,36 . The next step would be to create more holistic models with realistic constellations of multiple exposures, such as both smoking and alcohol intake for bowel cancer. This again requires information on correlations between these exposures, such as provided by this study.
In another key area of application, information on relationships between risk factor exposures also underlies the development of comprehensive intervention programmes that help people modify their lifestyles. While targeting multiple, possibly uncorrelated behaviours simultaneously can reduce the completion rate of interventions 37 , targeting correlated behaviours might improve success. For example, one study found that a www.nature.com/scientificreports/ Adjusted odds ratio (OR; y-axis) for study participants depending on factor score (x-axis), with all other covariates held constant, and the individual with 12.5% percentile score as reference (OR = 1). Odds ratios are adjusted for age, BMI, self-reported health at baseline, the number of GP visits in the 2 years prior to baseline, private health insurance, remoteness of residence, and where relevant, self-reported participation in cancer screening prior to baseline, or tannability-related covariates (see "Methods"). All estimates and results from sensitivity analyses see Supplementary Table S6 www.nature.com/scientificreports/ joint intervention for smoking and alcohol intake temporarily reduced smoking better than an intervention for smoking alone 38 , and that smoking lapses often occurred with alcohol use 39 . Moreover, the atlas of associations between cancer-relevant risk behaviours and sociodemographic characteristics provides information for the design of targeted intervention approaches to include social determinants, suggesting which population groups have higher exposure to given risk factors. For example, we found that remoteness of residence was associated with both higher 'alcohol' and 'meat & BMI' combined factor scores for males, suggesting potential interventions to reduce alcohol intake, meat consumption, or obesity levels might be targeted to remote regions. In addition to dependencies between cancer risk factor exposures, it is possible that the effects of some exposures on cancer risk may not be independent. Very large sample sizes are necessary to reliably detect interactions, hence the results in this study are provided to generate hypotheses for testing in future work. We found a possible interaction between alcohol consumption and MHT status on breast cancer risk, with the highest risk for alcohol consumption for females taking MHT at recruitment (i.e. a departure from a multiplicative model). Alcohol is known to increase breast cancer risk for both pre-and post-menopausal females, with likely complex causal mechanisms 40 . Previous meta-analyses have shown that alcohol consumption affects sex hormone levels including oestradiol 41 , and the increase in circulating oestradiol levels with alcohol consumption is thought to affect the formation or growth of cancerous cells 42 . Notably, a small double-blind, placebo-controlled crossover study found that alcohol consumption led to a threefold increase in circulating oestradiol for females taking MHT, with no significant change in those not taking MHT 43 . However, residual confounding remains a possibility. Hence larger follow-up studies will be crucial to confirm whether an interaction effect is present, and if so, whether it relates to a specific MHT type.
Some of the associations identified between combined factors and cancer incidence can also serve to generate new hypotheses to be followed up in more targeted studies. As expected and noted above, almost all (14/16) of the most significant associations reflect well-known cancer risk factors (Supplementary Note). Of the nominally significant associations (0.00125 < P < 0.05), several reflect relationships that have also been reported previously, including associations between the 'vigorous exercise' factor and breast cancer 44 incidence for females and incidence of all cancers 45 for males (decreasing risks with higher scores), and between the 'alcohol' factor and bowel 46 and prostate cancer 47 incidence for males (increasing risks with higher scores). Some associations have contradictory evidence from past studies and should thus be considered as potential false-positives due to chance or confounding. For example, some cohort studies have also reported increased melanoma incidence with MHT use (e.g. specifically for estrogens 48 ), although a small clinical trial did not find a significant effect 49 . It is possible that the association depends on MHT type, data for which were not available in this study.
This study has several limitations. First, the 45 and Up Study participants were limited to those aged at least 45 years. While we did not see different correlations between original risk variables by 10-year age groups (data not shown), these correlations cannot necessarily be generalised to those below 45 years of age. The generalisability is also limited by sampling bias of participants, who are known to be healthier and of lower social disadvantage than the general population 9 . Moreover, the correlations may be different among specific population subgroups (e.g. by social disadvantage, or cultural background); investigating this was beyond the remit of this study. We also note that the correlations between risk factor exposures and the associations between risk factor exposure patterns and sociodemographic characteristics may be different in other countries. However, representativeness is not required for reliable relative risk estimates from internal comparisons, e.g. when testing associations between combined factors and cancer incidence 50 . Second, the data on cancer risk exposures and sociodemographic characteristics were self-reported, which could lead to biases due to participants' recall. While past work has shown that e.g. self-reported use of medications for chronic conditions agreed well with administrative data 51 , this might not extend to lifestyle behaviours, especially exposures or characteristics that are possibly stigmatised. Table 2. Association between alcohol and breast cancer incidence, stratified by MHT use, with a focus on post-menopausal females to adequately reflect dependencies between MHT use and menopausal status. CI confidence interval. a Adjusted for age, Body Mass Index (BMI), self-reported health at baseline, the number of GP visits in the 2 years prior to baseline, self-reported participation in breast screening prior to baseline, private health insurance, remoteness of residence. *P < 0.05, **P < 0.01. b OR = odds ratio (per 1 unit change in the continuous variable). www.nature.com/scientificreports/ Moreover, for some risk behaviours, the question related to usual behaviour around the time of recruitment (e.g. "On how many days each week do you usually drink alcohol?"). Thus, information on cumulative lifetime risk exposure was only available for some of the risk factors. Third, this study was limited to available data, for example, it is known that cancer risk differs by MHT type 52 , but this information was not available. Fourth, we used the number of GP visits in the 2 years prior to baseline as a covariate in the analyses of cancer risk. As data to capture GP visits was only available from June 2004, this variable would not be captured correctly for the approximately 14% of participants who were recruited prior to June 2006. However, the second covariate used for health at recruitment (self-rated health) was captured for everyone. Finally, while it would be of interest to identify the exact contributions of the original exposure variables to the associations between combined factors and cancer incidence, these in-depth follow-up analyses are beyond the scope of the current study.
In summary, this study provides a large-scale, systematic analysis of cancer risk exposures in a large-scale population cohort. The identified relationships between risk variables can be used to inform a wide variety of future studies, and design interventions targeting multiple correlated behaviours. Further information for targeting such approaches is provided by the associations between combined factors and sociodemographic characteristics. This study also shows the potential of factor analysis as an approach for identifying associations between exposure patterns and cancer risk.