Identification of phenomic data in the pathogenesis of cancers of the gastrointestinal (GI) tract in the UK biobank

Gastrointestinal (GI) cancers account for a significant incidence and mortality rates of cancers globally. Utilization of a phenomic data approach allows researchers to reveal the mechanisms and molecular pathogenesis of these conditions. We aimed to investigate the association between the phenomic features and GI cancers in a large cohort study. We included 502,369 subjects aged 37–73 years in the UK Biobank recruited since 2006, followed until the date of the first cancer diagnosis, date of death, or the end of follow-up on December 31st, 2016, whichever occurred first. Socio-demographic factors, blood chemistry, anthropometric measurements and lifestyle factors of participants collected at baseline assessment were analysed. Unvariable and multivariable logistic regression were conducted to determine the significant risk factors for the outcomes of interest, based on the odds ratio (OR) and 95% confidence intervals (CI). The analysis included a total of 441,141 participants, of which 7952 (1.8%) were incident GI cancer cases and 433,189 were healthy controls. A marker, cystatin C was associated with total and each gastrointestinal cancer (adjusted OR 2.43; 95% CI 2.23–2.64). In this cohort, compared to Asians, the Whites appeared to have a higher risk of developing gastrointestinal cancers. Several other factors were associated with distinct GI cancers. Cystatin C and race appear to be important features in GI cancers, suggesting some overlap in the molecular pathogenesis of GI cancers. Given the small proportion of Asians within the UK Biobank, the association between race and GI cancers requires further confirmation.

knowledge of carcinogenesis, especially the mechanisms underlying the relationships among phenome, genome and environmental impact.
Following the completion of the Human Genome Project, comprehensive explorations into the human phenome have become crucial, forming a foundational framework for deciphering the intricacies of human health codes, especially the complex relationships among the phenome, genome and environmental influences 7 .Previous research has documented the use of phenomic data to investigate the correlation between environmental factors, such as diet and lifestyle, and the development of various cancers [8][9][10][11][12] .Furthermore, different toolkits associated with cancer phenomics were examined [13][14][15][16] .In the realm of cancer research, there is currently a limited body of knowledge regarding the exploration of diverse cancer types using phenomics data sourced from extensive datasets that encompass comprehensive clinical information.This scarcity is particularly evident when considering GI cancer, where the understanding of the complex interplay between phenotypic characteristics and the underlying molecular mechanisms remains relatively under-explored.Given this research gap, the UK Biobank emerges as an unparalleled resource, presenting an extraordinary opportunity to address the gap in our understanding of the involvement of phenomic data in GI cancer research.
The UK Biobank is a prospective cohort study of considerable magnitude that has enlisted more than 500,000 individuals between the age of 40 and 69 years from various regions of the UK during the period of 2006 to 2010.The extensive sample size and comprehensive data collection of phenotypic and genotypic data enable the examination of intricate associations between socio-demographic factors, blood chemistry, anthropometric measurements, and lifestyle of participants, thereby facilitating the development of more efficacious prevention and treatment approaches 17,18 .Within the national cancer registry, UK Biobank participants have contributed to a large accumulation of data comprising over 43,000 newly reported cancer cases up to the present.The UK Biobank is uniquely equipped to facilitate research into the factors that contribute to the onset of disease.It facilitates the identification of risk factors that increase or decrease the likelihood of developing specific diseases, as well as the precise quantification of these associations' magnitude.In addition, the substantial diversity observed in the intensity of these associations across various demographic, socioeconomic, and lifestyle characteristics provides an opportunity to assess the applicability of these associations to substantial subgroups of the population 19,20 .
Growing evidence highlights the importance and pressing need for utilizing phenomics in the examination of diseases [21][22][23][24] .Given the limited research on GI phenomics, this study was undertaken to explore the correlation between GI cancers and phenomic characteristics within the UK Biobank cohort.The UK Biobank offers a comprehensive range of socio-demographic, anthropometric, and biological markers, including blood and urine biomarkers, making it a valuable resource for this investigation.It is anticipated that the outcomes of this study will contribute to a more profound understanding of the multi-omics composition of patients, complemented by clinical data.This understanding, in turn, is expected to facilitate the identification of diagnostic, prognostic, and predictive biomarkers.Furthermore, the insights gained from this research endeavor hold the potential to unveil effective pathways for the personalized treatment of a diverse range of targeted diseases.

Study design and participants
The UK Biobank is a prospective cohort study with the aim of investigating how various diseases are caused by genetic, environmental, and lifestyle factors.Every participant in the UK Biobank provided informed consent upon enrolment, granting permission for the sharing of anonymized data with authorized researchers.Participants retained the right to withdraw their consent for data sharing at any point during their participation.All participants were registered with the National Health Service (NHS) in the United Kingdom.Participants completed a self-administered touchscreen questionnaire regarding their sociodemographics, lifestyle behaviours, medical history, and medication use during the initial recruitment session.They also underwent physical measurements such as weight, height, waist circumference, and hip circumference.Detailed information of the UK Biobank has been reported previously 25 .Since its establishment in 2006, a total of 502,369 subjects aged 37-73 years were recruited between 2006 and 2010 and followed up since then 19 until the date of the first cancer diagnosis, date of death, or the end of follow-up on December 31st, 2016, whichever occurred first.Access to the UK Biobank data was applied and approved (Application number 96759).This study was also approved by the Malaysia Medical Research and Ethics Committee (NMRR ID-23-00931-SPO).
Figure 1 illustrated the flow diagram for exclusion and inclusion in this study.After taking into consideration the exclusion criteria, we included 433,189 controls for our analyses.Controls were participants who did not have a record of ever being diagnosed with cancer according to the 10th Revision of the International Classification of Diseases (ICD-10).As for the cases, we included incident cancer cases who had GI cancers as coded using the ICD-10.GI cancers referred in this study included C15 oesophageal cancer, C16 gastric cancer, C17 small intestine cancer, C18 colon cancer, C19 rectosigmoid junction cancer, C20 cancer of rectum, C21 cancer of anus and anal canal, C22 liver cancer, C23 gallbladder cancer, C24 cancer of other and unspecified parts of biliary tract, C25 pancreatic cancer and C26 cancer of other digestive organs.We excluded participants with any GI cancers diagnosed within two years from recruitment (n = 55,340) to account for reverse causation, and those with missing date of cancer diagnosis (n = 7888).Finally, we included 7952 participants who had GI tract cancers as coded using the ICD-10.

Phenomic analysis
Sociodemographic characteristics (gender, age, and race) and lifestyle factors (smoking, alcohol drinking and physical activity) were collected during baseline assessment.Smoking status and alcohol drinking status were categorized as Never, Previous or Current smoker or alcohol drinker respectively, as recorded in the UK Biobank and reported previously 26,27 .Townsend deprivation index score which reflected the socioeconomic status were www.nature.com/scientificreports/calculated for each participant based on the postcodes of residence.Age was calculated based on age from date of birth and baseline assessment visit.Physical activity was collected based on number of days that the participants had moderate or vigorous activity for at least 10 min.As part of the research interest is to investigate the differences between race in cancer occurrence, we also explored race as categorized by White (British, Irish, White and any other White background) versus Asians (Chinese, Indian, Pakistani, Bangladeshi and any other Asian background).
Anthropometric measurements including height, body weight, waist circumference and hip circumference were taken by trained nurses during the baseline assessment visit 28 .Body mass index (BMI) was calculated as weight/height 2 .We then categorized the BMI based on the WHO BMI classification 29 .
Blood and/or urine samples were available for all the participants.Participants without any of the blood and urine biomarkers available were not included in this study.However, there were missing data for the variables included in this study.No imputation was performed for the missing data.For the analysis of each individual parameter, participants with missing data were excluded.Due to the large number of variables involved, information on the missing data is available upon request.

Statistical analyses
This study aimed to determine the risk factors for GI cancers.The risk factors explored consisted of 59 parameters as mentioned previously.The outcome was defined as the diagnosis of incident total GI cancers (based on ICD-10 C15-C26) and each individual diagnosis of GI cancers.Descriptive statistics were used to describe the characteristics of these variables based on each individual GI cancers, total GI cancers and healthy controls (Table 1).
We employed univariable and multivariable logistic regression analyses of phenomic features against outcomes of interest (first/initial diagnoses of disease) in this study, similar to studies conducted by Gausman et al. 32 and Kang et al. 33 .Initially, univariable analysis using logistic regression was applied to determine the significant risk factors for the outcomes of interest.The Benjamini-Hochberg correction was implemented to control the False Discovery Rate (FDR) and mitigate the risk of false positives.The Benjamini-Hochberg correction is a widely accepted method for controlling the FDR, offering a more balanced approach than the Bonferroni correction.Unlike the Bonferroni correction, which is known for its conservative nature and increased likelihood of false negatives, the Benjamini-Hochberg procedure allows for a more nuanced control of the error rate.By controlling the FDR instead of the Family-Wise Error Rate (FWER), the BH procedure provides a good balance between identifying true positives and limiting false positives.Since the cohort data analysed was large, besides relying on the p-value, odds ratios (OR) of more than 2.0 and 1.5 for categorical and numerical variables respectively was fixed to screen the significant and important risk factors 34 .Univariable logistic regression results for all variables can be found in Supplementary Table 1.All the significant variables in the univariable logistic regression for each GI cancers and total GI cancers were listed in Table 2. Next, multivariable logistic regression based on the variables identified in Table 2 were conducted.www.nature.com/scientificreports/were also associated with some GI cancers (oesophageal cancer for both factors and liver cancer for smoking status).Anthropometric measurement (body mass index classification) was also associated with oesophageal, gastric and liver cancers.
In terms of biochemical markers, several cardiovascular health markers (apolipoproteins A and B, HDL cholesterol, LDL cholesterol) on top of ionized calcium and phosphate were associated with some GI cancers.There were also hematological markers, including basophil, eosinophil, erythrocyte and monocyte that are found to be associated with some GI cancers.
These variables that were significantly correlated to GI cancers based on univariable logistic regression were further analysed in multivariable logistic regression (Table 3).Against total GI cancers, an increase of 1 mg/L cystatin C multiplied the odds of getting total GI cancers by 2.43 times.Analysis also found that compared to Asians, participants of White race had a 2.22 times higher risk of getting GI cancers.
When we looked at each individual GI cancer, cystatin C remained significantly associated with the cancers with an adjusted odds ratio of at least 1.97.Cystatin C and White participants had a higher risk of getting colorectal cancers (adjusted OR of 2.11 and 2.54 respectively), which was the top one GI cancer in the UK Biobank cohort.Participants with a diagnosis of pancreatic cancer are associated with higher cystatin C (adjusted OR 2.15), eosinophil count (adjusted OR 1.41) and White ancestry (adjusted OR 2.51 compared to Asians).For oesophageal cancer, besides cystatin C and of White race, lower Apolipoprotein A1, higher monocyte count and lower ionized serum calcium were associated with higher risk of getting oesophageal cancer.Compared to normal weight participants, those who were underweight (adjusted OR 2.75), overweight (adjusted OR 1.44) and obese (adjusted OR 1.84) were associated with oesophageal cancer.Previous and current smokers were also found have higher risk (adjusted OR 1.82 and 2.77 respectively) of getting oesophageal cancer.On the other hand, gastric cancer was associated with male subjects, those with BMI other than normal weight, lower HDL cholesterol and higher monocyte count.Lastly, in addition to cystatin C, apolipoprotein B, phosphate, monocyte count, males and those of smoking history were associated with liver cancer.

Discussion
In this large UK Biobank cohort study of a total of 7952 incident GI cancer cases, we aimed to investigate the associations between the phenomic features and GI cancers to better understand the molecular pathogenesis of GI tract cancers.The analysis included a total of 441,141 participants in the study, of whom 7952 (1.8%) were incident cases of GI cancer and 433,189 were healthy controls.The results demonstrated significant associations between certain variables and different types of GI cancers, providing valuable insights into the risk factors and potential biomarkers associated with these cancers.The characteristics of the GI cancer group were substantially different from those of the control group, with the GI cancer group being older, predominantly male, having a higher BMI, and containing a greater proportion of current or former smokers.
The distribution of the top five GI cancers observed in this UK Biobank cohort was found to be consistent with global trends 3,36 .Colorectal cancer (47.01% of the total GI cancer cases) emerged as the most common GI cancer, followed by pancreatic cancer (11.12%), oesophageal cancer (9.91%), gastric cancer (7.59%), and liver cancer (6.50%).This pattern aligns with previous studies and reflects the epidemiology of GI cancers on a global scale, indicating the generalizability of the findings from this cohort to broader populations.As the top five GI cancers in the UK Biobank cohort represented 82% of the entire GI cases, subset analysis focuses on the five out of nine major GI cancer categories.
One of the notable findings in this study was the consistent association of cystatin C and race with each type of GI cancer.Cystatin C, a biomarker related to kidney function 37,38 , was found to be consistently raised and associated with all GI cancers in this cohort.Participants with higher cystatin C levels exhibited an increased risk of developing GI cancers, suggesting its potential as a prognostic biomarker.This finding is corroborated in previous literature [39][40][41][42] .Cystatin C exerts a series of complex effects that may result in either an inhibition or a promotion of tumour cell growth and dissemination, as demonstrated by previous research 39,40 .A recent study discovered a novel mechanism of mast cells inducing endoplasmic reticulum stress in which Cystatin C mediates tumor inhibition during colorectal cancer development 41 This function of Cystatin C in cancer cells has never been reported and may lead researchers one step closer to understanding the molecular pathogenesis of GI cancers in relation to cystatin C.
Additionally, race was found to be significantly associated with total GI cancers, with Whites having a higher risk compared to Asians.The influence of race is also evident in subsets colorectal, pancreatic and oesophageal cancers in this study.Epidemiological studies have examined the association between race, specifically White and Asian populations, and gastrointestinal malignancies, including colorectal, pancreatic, esophageal, gastric, and liver cancer 3,[43][44][45][46] .Similarly, results showed that gender played a role in the difference in GI cancer incidence, particularly gastric and liver cancers, males having 2.8, 2.4 and 1.7 times more likely to get the cancers respectively.This finding is in line with current literature 45,47,48 .While acknowledging the limited representation of Asians in the UK Biobank cohort, the study emphasizes that phenotypic feature identification is its main goal in relation to GI malignancies.Importantly, the study emphasized that the relatively small number of Asians in the cohort should not undermine the robustness of the scientific inferences drawn regarding associations between exposures and health conditions.
In addition to sociodemographic characteristics, lifestyle factor particularly smoking status was proven to be associated with certain GI cancers, including liver and oesophageal cancers.Smoking status still remained a significant factor in the multivariable logistic regression analysis for liver cancer.Interestingly, exposure to smoking (including those who had stopped smoking) consistently increased the risk of developing GI cancers.This is supported and demonstrated in other studies as well 45,47,49 .Cancer incidence and mortality rate variations are influenced by several factors, including genetic, environmental, lifestyle, and socioeconomic variables [43][44][45]50 . wwwnature.com/scientificreports/Anthropometric measurement (body mass index classification) showed associations with oesophageal and gastric cancers.In line with the work of other researchers, we demonstrated U-shape relationship between BMI and the three cancers [51][52][53] .This abundant evidence of excess body weight over the past few decades indicates an emphasis on lipid metabolism and mechanisms involved in malignancies 26,[51][52][53][54] .As demonstrated in this study, apolipoprotein A1, apolipoprotein B and HDL cholesterol were associated with oesophageal, gastric and liver cancers.Studies have suggested that apolipoproteins play critical roles in malignancies including GI cancers.Low apolipoprotein A1 level is linked to a high cancer risk, systemic inflammatory response and poorer survival in some cancers, including oesophageal squamous cell carcinoma [55][56][57][58] .This is in accordance with our study findings. Apolioprotein A1 is a protein component of HDL cholesterol.Similar to apolipoprotein A1, HDL cholesterol is inversely associated with cancers, as demonstrated in the subset gastric cancer in this study.One of the proposed mechanisms of the opposing role in tumorigenesis of HDL cholesterol is its modulation of cell cycle entry and apoptosis through the mitogen-activated protein kinase-dependent (MAPK) pathway 59 .A Korean crosssectional study also reported the association between reduced HDL/apolipoprotein A1 levels and an increased risk of colorectal cancer 60 .Emerging evidence suggests that the apolipoprotein A1/HDL axis, involved in lipid metabolism, is dysregulated in cancer.mRNA levels of apolipoprotein A1 were lower in hepatocellular carcinoma compared to normal liver tissue, the primary source of apolipoprotein A1, as determined by Oncomine database microarray data 61 .In hepatocellular carcinoma, the mechanisms underlying the transcriptional repression of apolipoprotein A1 remain obscure.
However, this result is consistent with previous reports of decreased apolipoprotein A1 protein levels in malignant liver tissue and hepatocellular carcinoma patient serum 62,63 .The decrease in apolipoprotein A1 transcription, intracellular and secreted apolipoprotein A1, and circulating HDL levels in hepatocellular carcinoma suggests that this pathway may have a tumor-suppressing function 61 .Several studies have discovered associations between serum apolipoprotein A1/HDL levels and various aspects of the natural progression of various cancer types 56,59,64,65 .Consistent with the study findings, high apolipoprotein B level was suggested as a risk factor for liver cancer; it is associated with poorer survival post surgery and a larger tumour size 66 .More in-depth exploration of the genetic information of apolipoproteins may indicate liver malignancy and thus should be further researched on.Mutations of apolipoprotein B is reported to account for almost 10% of all genetic mutations 66 .Specifically, a non-oncogenetic mutation of apolipoprotein B is observed, which can result in apolipoprotein B inactivation and is associated with the overexpression of oncogenic regulators and the downregulation of tumour suppressors, resulting in poorer survival outcomes.It is hypothesised that mutations that render apolipoprotein B inactive are preferred in tumorigenesis in order to provide more energy for cancer metabolism 55,65 .
Multivariable logistic regression demonstrated that ionized serum calcium level was inversely associated with the risk of oesophageal cancer (adjusted OR = 0.37, 95% CI 0.18-0.74;p-value = 0.005).This is in line with studies that established the significance of calcium intake, in particular, as a potential effect modifier of the association between calcium and diseases including GI tract neoplasia [67][68][69] .Increasing dietary calcium intake was associated with lower risk of oesophageal cancer [67][68][69] .There seems to be inconsistent findings on the relationship between serum calcium and risk of cancer in current literature.The Swedish AMORIS study exploring GI cancers specifically oesophageal, stomach and CRC cancers, showed positive association between albumin-adjusted serum calcium and risk of these GI cancers 70 .Nevertheless, a study exploring two large European prospective cohorts (including the UK Biobank) corroborated our study findings on ionized serum calcium level and risk of liver and colorectal cancer 71 .
The different direction of the association between the UK Biobank and EPIC cohorts, and the AMORIS study was attributed to differences in study design and the degree of adjustment for confounding variables 71 .It is worthwhile to discuss on this study's focus on serum calcium measurement rather than dietary calcium intake.Serum calcium indicates extracellular calcium homeostasis and is mainly regulated by vitamin D and parathyroid hormone.Consequently, abnormalities in serum calcium level may reflect an error in its regulation pathways instead of dietary calcium deficiency.This may result in distinct associations between calcium in the diet and serum and carcinogenesis [70][71][72] .Besides calcium, phosphate is also found to be inversely associated with liver cancer (adjusted OR = 0.36; 95% CI 0.22-0.58;p-value = 0.001).There is little research on phosphate and cancers, with inconsistent trends among the studies and/or cancers 54,73,74 .It is accepted that altered levels of phosphate have been linked to the onset of cancer, but with uncertainties on the pathophysiology behind it.More in-depth studies are warranted to better understand the positive and inverse correlation observed between calcium and phosphate levels, and the risk of cancers.This will shed light on the involvement of calcium and phosphate metabolism, and potentially related important hormonal factors and cancer.
Additionally, hematological markers including monocyte and eosinophils were related to some GI cancers.Monocytes and eosinophils are a type of white blood cell.Interestingly, there are scarce research on the association of eosinophils and monocytes in GI cancer.Despite that, the value of immune-related markers in cancers are acknowledged.Previous studies focused mainly on pre-operative values of these circulating cells, however, changes in the immune profile may occur months or years prior to cancer diagnosis due to its role in the etiopathegenesis of tumours 75 .White blood cells were previously found to be associated with increased risk of colorectal, lung and breast cancer 76 .Preclinical data showed that eosinophils have both pro-tumorigenic and anti-tumorigenic properties, via direct and indirect mechanisms.This varying outcomes in different studies imply that the role of eosinophils and their mediators may differ depending on the cancer type [77][78][79] .
These findings provide valuable insights into the associations between various factors and GI cancers within the UK Biobank cohort.The identification of significant associations contribute to our understanding of the underlying mechanisms and risk factors involved in the development of GI cancers.The consistent association of cystatin C with different types of GI cancers suggest its potential as a promising biomarker for early detection and risk stratification.The findings from this study will guide our subsequent way forward to explore the whole exome sequencing data in GI cancers within the UK Biobank.This will promote a multi-omic methodology to help characterize GI cancers and associated phenomic features.Specifically, variants within the exome region of the genome, which is responsible for encoding proteins, can serve as valuable indicators for the identification of genetic variants that are highly relevant to drug discovery 80 .
Notable strengths of this study include its prospective study design involving a large sample size, a lengthy follow-up period and evaluation of a comprehensive list of covariates.In addition, all biochemistry markers were measured using well-established and validated methods, ensuring accuracy and reliability throughout the study.This study, is however, not without its limitations.Despite UK Biobank not being suitable for determining universally applicable rates of disease prevalence and incidence, its substantial size and diverse exposure measures allow for valid scientific inferences on associations between exposures and health conditions.Such assessments can be widely generalizable and do not necessitate participants to be representative of the population at large 19,81,82 .In addition, this study focusses on the phenomic data involved on the pathogenesis of GI cancers, with aim to identify the potential phenomic feature(s) associated with the pathogenesis of GI cancers, and not to associate with incidence rate.Although the number is small, this is a cross sectional analyses of UK Biobank data, which still represent the largest database at present, and present findings are in accordance with previous studies looking into different health outcomes and their associations with race.In addition, the study relied on self-reported lifestyle data, which introduces the possibility of recall bias.To validate and expand upon these findings, additional research with diverse populations and rigorous data acquisition techniques is required.Besides, in terms of study data, no information was available regarding potential confounding variables such as vitamin D and/or calcium supplementation.Furthermore, we were unable to explain the effect of dietary calcium on gastrointestinal carcinogenesis, as suggested by biological studies (Supplementary Table 1).
In conclusion, this study identified several significant associations between various factors and GI cancers using the UK Biobank cohort.A marker Cystatin C emerged as a consistent biomarker associated with different types of GI cancers.Given the small proportion of Asians within the UK Biobank, the association between race and GI cancers requires further confirmation.The findings provide valuable insights into the potential diagnostic and therapeutic targets for GI cancers, emphasizing the importance of personalized approaches in cancer prevention, early detection, and treatment strategies.In order to provide more in-depth understanding of how these factors were associated with GI cancers and shed light on the molecular pathogenesis of GI cancers, future research should employ a multi-modal approach exploring the genomics and proteomics of the UK Biobank cohort.This will allow validation of the study findings and enhance understanding on the underlying mechanisms linking these factors to GI cancer development.

Table 1 .
Baseline characteristics of the study population in the UK Biobank.Mean (standard deviation) is presented for continuous variables.a Mean (SD) values and n (percentages) are reported for continuous and categorical variables, respectively.b Other GI cancer includes small intestine cancer, cancer of anus and anal canal, gallbladder cancer, cancer of other and unspecified parts of biliary tract, and cancer of other digestive organs.