The shared genetic architecture between epidemiological and behavioral traits with lung cancer

The complex polygenic nature of lung cancer is not fully characterized. Our study seeks to identify novel phenotypes associated with lung cancer using cross-trait linkage disequilibrium score regression (LDSR). We measured pairwise genetic correlation (rg) and SNP heritability (h2) between 347 traits and lung cancer risk using genome-wide association study summary statistics from the UKBB and OncoArray consortium. Further, we conducted analysis after removing genomic regions previously associated with smoking behaviors to mitigate potential confounding effects. We found significant negative genetic correlations between lung cancer risk and dietary behaviors, fitness metrics, educational attainment, and other psychosocial traits. Alcohol taken with meals (rg = − 0.41, h2 = 0.10, p = 1.33 × 10–16), increased fluid intelligence scores (rg = − 0.25, h2 = 0.22, p = 4.54 × 10–8), and the age at which full time education was completed (rg = − 0.45, h2 = 0.11, p = 1.24 × 10–20) demonstrated negative genetic correlation with lung cancer susceptibility. The body mass index was positively correlated with lung cancer risk (rg = 0.20, h2 = 0.25, p = 2.61 × 10–9). This analysis reveals shared genetic architecture between several traits and lung cancer predisposition. Future work should test for causal relationships and investigate common underlying genetic mechanisms across these genetically correlated traits.

www.nature.com/scientificreports/ unknown population stratification or cryptic relatedness exists in the underlying data 24 . Prior GWAS investigations in lung cancer have revealed unique loci with strong statistical significance, yet, these regional associations vary across histological subtypes of lung cancer 2 . On top of heterogeneity between histological subtypes, known lung cancer risk loci only account for a minor proportion of the total estimated heritability of lung cancer, indicating a substantial proportion of the heritable causes 25 of lung cancer remains unidentified. A more comprehensive approach to understanding tumorigenic mechanisms may be fruitful. Focused work into understanding the genetic architecture behind disease co-development may be more informative than studying individual phenotypes 26,27 . A knowledge gap exists today to quantify the extent that other diseases, environmental exposures, and phenotypic traits correlate with a predisposition to lung cancer. A novel regression statistical framework, known as cross-trait linkage disequilibrium score regression (LDSR), may be employed to fill this gap in knowledge. LDSR uses GWAS summary statistics to identify genome-wide genetic correlations between phenotypes of interest 28 . The similarity of measured SNP effect estimates reported by GWAS summary statistics are compared between traits. LDSR allows for accurate calculations of genetic co-correlation (r g ) between phenotypes while minimizing effects from selection biases in the recruitment of comparable controls from the same source population 24 . Use of this method can identify correlations in the genetic architecture between traits, allowing etiological insights to be gleaned.
Here, we quantify the association between genetically influenced epidemiological and behavioral traits and the risk of lung cancer. We use summary statistics generated by prior lung cancer GWAS and use LDSR to estimate cross-trait genetic correlations with lung cancer. We additionally evaluate how these traits correlate with each of the major histological subtypes of lung cancer-adenocarcinoma, squamous cell carcinoma, and small cell carcinoma, and further evaluate associations in ever-and never-smokers. We aimed to confirm prior associations with lung cancer and to identify novel phenotypic associations from GWAS datasets.

Methods
Summary statistics for lung cancer. This work is a continuation of efforts conducted by the Transdisciplinary Research of Cancer in Lung of the International Lung Cancer Consortium (TRICL-ILCCO) 29 and the OncoArray Consortium 30 . The TRICL-OncoArray Consortium has previously published GWAS summary statistics results after a meta-analysis of lung cancer GWAS. The complete methods have been published previously 29,30 , but are presented here in brief. Lung cancer patients and healthy controls with no personal lung cancer history were recruited after individual institutional IRB approval and informed consent for genotyping. Genotyping occurred using the Illumina OncoArray-500K BeadChip of 533,631 SNPs. Standard quality control measures were implemented to exclude underperforming samples and SNPs 29 . Individuals and SNPs with genotyping call-rates < 95% were removed. Genotype imputation was conducted using the reference dataset of the 1000 Genomes Project Phase 3 (October 2014). The more common variant was included during the imputation process for positions with > 2 alleles. After imputation and quality control processes, 502,933 SNPs from 29,266 lung cancer patients and 56,450 healthy controls of European ancestry were incorporated into a metaanalysis 29 . Amongst the lung cancer cases, 11,273 cases of adenocarcinoma, 2,664 cases of small cell carcinoma, and 7,426 cases of squamous cell carcinoma were represented as histological subtypes (Supplementary Table 1). We obtained and utilized the summary statistics from the TRICL-ILCCO GWAS meta-analysis 29 regarding lung cancer, the histological subtypes of lung cancer, and summary statistics for ' ever' vs. 'never '-smoking status subcohorts.
Phenotype and exposure accession with United Kingdom Biobank genome-wide association studies. GWAS summary statistics for cross-trait LDSR analyses were obtained from the United Kingdom Biobank (UKBB). The UKBB is a national and international health repository 31 . Since its inception in 2006, the UKBB has collected clinical and genotypic data for 500,000 adult participants across 22 sites in the United Kingdom 31 . Participants in this longitudinal project were age 40-69 at enrollment. Initial relevant information is gathered by clinical exam, questionnaire, and biospecimen sampling. Participants will be followed for 30+ years. Periodically, follow-up health data are obtained by a linked unique encrypted identifier with electronic health records from the UK National Health Service (NHS). Each of the > 500,000 participants in the UKBB has been genotyped, 90% of which were genotyped using a custom Affymetrix UKBB Axiom array. This array assayed ~ 850,000 variants across the genome, which were used to impute 9.1 million SNPs with satisfactory quality control measures in place. These imputation procedures are conducted by the Wellcome Trust Center for Human Genetics and are conducted internally at the UKBB before the data release. GWAS was conducted from these imputed data, and summary level statistics were made publicly available (https:// neale lab. github. io/ UKBB_ ldsc/ downl oads. html# refer ence_ files). We obtained all of our GWAS summary statistics from the second batch of UKBB GWAS results published online and updated in August 2018.
Harmonization and quality control with SNP filtering. We harmonized the obtained publicly available GWAS summary statistics. Our final dataset included summary statistics for selected epidemiological and individual lifestyle traits, including alcohol use and fitness activity levels and routines. The final dataset also included biometric measurements, including BMI and body fat percentage measurements. Reported educational attainment, employment status, workplace environment, and psychological experiences were also included. These obtained UKBB summary statistics contained SNP-level effect sizes (beta) for each trait, with Z-scores calculated by dividing SNP effect sizes by their standard error. To harmonize these datasets, and as an additional quality control measure, we filtered the imputed SNPs from the UKBB to include only those autosomal SNPs with a minor allele frequency greater than 0.01 and imputation quality INFO score greater than 0.90. We further www.nature.com/scientificreports/ removed SNPs from our harmonized data set that were not in HapMap3 with a minor allele frequency less than 5% in European populations, in line with previously published methods 24,32 . Estimating pairwise genetic correlations and heritability. With this information, we estimated genome-wide SNP heritability using LDSR. Additionally, we used LDSR to compute the pairwise genetic correlation between each of the UKBB traits with lung risk from the TRICL-OncoArray consortium. LDSR calculates genetic correlation by regressing the product of SNP z scores (Z UKBB * Z TRICL-OncoArray ) against the SNP's calculated linkage disequilibrium score 24 . The slope of this regression accurately estimates the genetic covariance between two traits. Genetic covariance is converted to a genetic correlation between traits by normalizing genetic covariance by the calculated heritability of each of the two compared traits. The heritability of a trait can be thought of as the genetic covariance of a trait with itself and ultimately represents the proportion of a trait that genetic effects can explain 28 . LDSR mitigates potential biases from population stratification 19 and cryptic relatedness 24 by modeling an intercept term that accounts for any genomic inflation. We applied a cross-trait LDSR model that included an intercept in these analyses to account for hidden biases that may exist between reference and target populations, especially those that may arise due to the instability of linkage disequilibrium scores in European populations and sub-populations 24,33 .
We used LDSR to calculate the genetic correlations between lung cancer risk and traits of interest. We additionally performed LDSR for each of the histological subtypes of lung cancer, including small cell carcinoma, squamous cell carcinoma, and adenocarcinoma. Further, we performed LDSR between traits of interest and lung cancer risk in ever-and never-smoker subgroups. Individuals who reported having smoked fewer than 100 cigarettes throughout their lives were defined as "never smokers," and those who had smoked more than 100 cigarettes in their life as "ever smokers" 29 . We stratified both lung cancer cases and controls by smoking status for these analyses.

Removal of known regions related to smoking behaviors. If a trait shows a genetic correlation
with lung cancer in LDSR analyses, this does not necessarily imply a causal relationship. Indeed, both the trait and lung cancer risk may be jointly influenced by a third, unmodeled trait that independently influences each. Notably, smoking status has the potential to confound our associations (e.g., the genetic correlation between lung cancer risk and emphysema risk would likely be attributable to the effect of smoking on both diseases) [34][35][36][37][38] . In addition to stratifying our LDSR analyses by 'never' and ' ever' smoking status as available from the TRICL-OncoArray Consortium, we also excluded genomic loci previously associated with smoking behaviors. A recent meta-analysis quantified the effect of SNPs on several smoking behaviors, including "age of initiation of smoking", "cigarettes per day", "smoking cessation", and "smoking initiation" 39 . These authors used a conditional analysis method 40 to identify SNPs independently associated with at least one of these smoking related traits. Applying a predetermined genome-wide significance threshold of p < 5 × 10 -8 , 467 SNPs were found to be associated with smoking related traits 39 . We repeated our LDSR analyses after removing each of these 467 smokingrelated SNPs from our summary statistics. Specifically, we identified the sentinel variant from the meta-analysis and removed all SNPs within ± 500 kb. SNPs that were filtered at this step appear in Supplementary Table 2, which also annotates the upper and lower bounds of the genomic regions removed. Changes in the number of SNPs included and excluded from this analysis, per histological subgroup and lifetime smoking status, appear in Supplementary Table 3. Quantile-Quantile plots of the p-values observed from the TRICL-OncoArray metaanalysis before and after removing smoking-related SNPs may be appreciated in Supplementary Figure 1.
We summarized and presented these methods graphically in Fig. 1. Multiple comparisons are conducted in executing these methods. We tested 347 traits and associated them to determine their genetic predispositions to develop overall lung cancers, adenocarcinomas, squamous cell carcinomas, and small cell carcinomas. Additionally, we tested these traits for associations in 'never' or ' ever' smoking populations. These 2082 independent tests were conducted twice, before and after the removal of smoking-related SNPs. In total, 4164 comparisons were performed. Using a stringent Bonferroni correction, we set our adjusted P value significance cutoff threshold to be less than 1.2 × 10 -5 , or − log 10 (P) > 4.92. Here we report the trait associations with significance metrics less than the Bonferroni adjustment. In Supplementary Table 5, we present the heritability, genetic correlations, significance values for each comparison conducted. In this table, we further provide LDSR confidence thresholds and heritability thresholds for each UKBB trait. We finally offer a direct uniform resource locator link for each UKBB trait, allowing for ease of inquiry into trait type counts, inclusion criteria, distribution histograms, and other relevant metrics.
In contrast, obtaining none of the previously mentioned academic qualifications demonstrated a positive genetic correlation with lung cancer susceptibility, which was strongest in overall lung cancer (r g = 0.38, p = 5.91 × 10 -12 ; r g † = 0.38, p † = 3.78 × 10 -16 ), and the trend held across histological subtypes and in ' ever' smokers. Fluid intelligence scores were genetically correlated with decreased lung cancer susceptibility across all histological and smoking status sub-classifications (overall r g = − 0.25, p = 4.54 × 10 -8 ) but did not reach statistical significance in 'never' smokers. The calculated Townsend deprivation index 41 , which is a metric combining the census demographics of car ownership, household overcrowding, household employment status, and house ownership, demonstrated significant increased genetic predisposition with lung cancers (overall lung cancer r g = 0.35, p = 1.03 × 10 -10 ; r g † = 0.28, p † = 9.61 × 10 -6 ). A summary of the significant education and employmentrelated associations is presented in Fig. 3.

Heritability and genetic correlations between lung cancer and fitness metrics. Measured and
reported fitness metrics were genetically correlated with lung cancer susceptibility. Increased body fat percentage, impedance of the whole body, waist circumference, and increased body mass index (BMI) correlated positively with lung cancer susceptibility. Highlighting BMI, positive genetic correlations were observed for overall lung cancer (r g = 0.20, p = 2.61 × 10 -9 ; r g † = 0.19, p † = 3.23 × 10 -8 ) as well as across small cell lung carcinoma (r g = 0.24, p = 3.54 × 10 -7 ; r g † = 0.24, p † = 5.27 × 10 -5 ), and squamous cell carcinoma (r g = 0.27, p = 9.91 × 10 -10 ; r g † = 0.26, p † = 1.01 × 10 -6 ). Similarly, positive genetic correlations were observed between body fat percentage and overall lung cancer (r g = 0.17, p = 6.11 × 10 -7 ; r g † = 0.17, p † = 1.23 × 10 -6 ) and squamous cell carcinomas (r g = 0.23, p = 1.85 × 10 -7 ; r g † = 0.23, p † = 9.81 × 10 -6 ). Participant-reported activity level traits demonstrated negative genetic correlation with lung cancer susceptibility. Physical activity traits include DIY physical activity in last 4 weeks, exercise such as swimming or cycling in the last 4 weeks, as well as cycling or walking as methods of transport when going to work. Contrarily, having 'no physical activity in the last 4 weeks' demonstrated increased genetic correlation with lung cancer susceptibility. We highlight "swimming, cycling, and keeping fit in the last 4 weeks" which demonstrated significant negative genetic correlations with lung cancer susceptibil- Similarly, the age of last live birth demonstrated also demonstrated a significant decrease in lung cancer susceptibility overall, and in the small cell, squamous cell and ever smoking cohorts. The trait 'age started oral contraceptive' bore significant genetic predispositions with overall lung cancer (r g = − 0.28, p = 1.30 × 10 -5 ; r g † = − 0.27, p † = 5.93 × 10 -5 ). These findings are further detailed in Fig. 5. A full correlation plot of all highly correlated traits is presented as Fig. 6, which includes all UKBB traits with significant genetic correlation with lung cancer after a Bonferroni correction for statistical significance. Figures 7  and 8 presents all nominally associated UKBB traits (p < 0.05) including their r g and standard errors in cohort clustered forest plots.

Discussion
We sought to determine the shared genetic architecture between environmental and behavioral factors and lung cancer predisposition. LDSR has previously demonstrated efficacy and accuracy in determining the shared heritability and genetic correlation between phenotypes and disease states of interest 42,43 . To date, the TRICL-OncoArray Lung consortium comprises the largest lung cancer GWAS conducted in European-ancestry populations 30 . We leveraged these lung cancer GWAS meta-analysis data with GWAS summary statistics of traits from the UKBB to comprehensively assess shared genetic architectures between specific traits and lung cancer risk, observing numerous significant associations that were consistent across strata of lung cancer histology. www.nature.com/scientificreports/ We observed significant positive and negative (i.e., protective) genetic correlations between lung cancer risk and individual behavioral characteristics and other environmental factors. We acknowledge that the strength of the LDSR method relies on the assumption that the genetic architectures between populations are similar. To ensure this, our analyses were conducted on European-ancestry populations in all studies, and SNPs included are those imputed using standard methods developed for application to the 1000 genomes project.
We provide further evidence that lung cancer is a heritable disease. Overall, our analysis estimated the heritability of lung cancer to be 8.3 ± 1.3%, with comparable heritability in adenocarcinoma (6.8 ± 1.0%), higher heritability in small cell lung carcinoma (10.5 ± 1.9%), and lower heritability in squamous cell carcinoma of the lung (5.2 ± 1.1%). These findings are similar to previous reports 29 . The heritability of lung cancer among never smokers was considerably lower than among smokers, which might indicate heterogeneity in etiology of lung cancer in never smokers obscures its heritable nature. It is noteworthy that we found no significant associations in LDSR analyses among the 'never' smoker's subgroup, but the observed genetic correlations in this cohort consistently mirroring the direction observed in ' ever' smokers and across histological subgroups. The neversmoker subgroup was a considerably smaller sample (2355 lung cancer cases, 7504 non-cancer controls) and had the lowest heritability of any of our lung cancer sub-strata, indicating that we may have been underpowered to detect cross-trait associations with this group.
The frequency and circumstance of alcohol consumption demonstrated a significant and mixed correlation with the genetic architecture of lung cancer. We found that "alcohol taken with meals" was negatively correlated with overall lung cancer. However, when analyzing this trend by type of alcohol consumed, higher average weekly beer and cider intake and higher weekly spirits intake were positively genetically correlated with lung cancer risk. In contrast, higher average weekly champagne, white wine, or red wine intake had a negative correlation. This effect has previously been observed through non-genetic epidemiological meta-analysis 44 , and, notably, we observe concordant findings through LDSR. One possible explanation is that concurrent smoking consumption is more likely in those who drink beer or partake in spirits and less likely in wine drinkers, possibly due to socioeconomic differences 45 . Evidence against this hypothesis includes that the genetic correlations with lung cancer and alcohol intake were consistent across histological subtypes and when contrasted against 'never' versus ' ever' smoking status, although non-significant in 'never' smokers.
Educational attainment traits demonstrated a consistent genetic correlation with lung cancer risk in LDSR analyses. Certifications of educational attainment were consistently negatively correlated with lung cancer susceptibility. The corollary is also true, with 'no educational qualifications' (i.e., no college or university degree), no professional qualifications in nursing or teaching, no "A" levels, and no general certificate of secondary education, demonstrating a positive correlation with lung cancer risk. These findings retained significance across histological subtypes. Removal of smoking-related SNPs as a method to mitigate residual confounding effects did not change the identified correlations or significance of these findings. Complementing these findings, it was independently found that fluid intelligence score, which had a consistent h 2 ~ 0.22 ± 0.01, demonstrated a consistent negative genetic correlation with lung cancer across histological subtypes and smoking statuses. www.nature.com/scientificreports/ Summary statistics for several quantitative as well as binary fitness-related traits demonstrated consistent associations. However, consistency in statistical significance was not achieved among each of the three histological subgroups. Indicators of BMI demonstrated relatively consistent findings. We highlight BMI and body fat percentage. These traits demonstrate significant heritability (h 2 ~ 0.22 ± 0.01) and have consistent positive genetic correlations with lung cancer. A general trend of negative genetic correlation between increased physical activity and lung cancer risk was observed; however, these findings had marginal estimated heritability at around ~ 0.03. BMI's causal role in lung cancer oncogenesis was recently validated using Mendelian randomization 6 , however the strength of association measured in this prior study varied by lung cancer histology.
Several specific traits stood out from these analyses. A modest correlation was observed for depression and depression-related psychosocial traits, including 'frequency of fed-up feelings,' 'frequency of uninthusiasm/ disinterest, ' and 'loneliness, isolation, ' and 'mood swings. ' These captured symptoms are part of the diagnostic criteria for mental illnesses, and it is worth noting that the incidence of smoking behavior in populations who suffer from mental illness is higher than those without mental illnesses 46 . Other specific standout traits genetically correlated with lung cancer risk included the participant-reported status of being breastfed as a baby. The heritability of this trait was 0.023 ± 0.002, however, a consistent negative genetic correlation with lung cancer was observed. While interesting, these findings were only significant after correction for multiple comparisons testing for the overall and squamous cell lung cancer histological subgroups. The age at which a woman undergoes her 1st and last live birth and the age she started oral contraceptives were other specific traits that demonstrated a genetic correlation with lung cancer risk. These traits each revealed appreciable trait heritability and consistent, highly significant negative genetic correlations. It is well known that the ages of first live birth 47 , last live birth 48 , and initiation of oral contraceptive pills 49 are associated with androgen modulation and modified cancer risk. It is logical that these traits are annotating such a reality in lung cancer predisposition 50,51 . We note that these results should be viewed as revealing only genetic associations, not for causal effect estimation. www.nature.com/scientificreports/ Our use of LDSR, with an intercept, allowed for acceptable mitigation of population stratification and cryptic relatedness confounders that could exist between the UKBB population and our TRICL-OncoArray lung cancer dataset. Further, we used individuals of European descent in these cohorts to mitigate this risk. Additional confounding, predominantly through smoking, have the potential to limit the strength of these analyses. To appreciate any hidden effects of smoking, we sub-stratified our analysis by those who had and had not smoked roughly 100 cigarettes in their lifetime. In addition to this 'never' versus ' ever' smoking comparison, we re-ran LDSR analyses after excluding genomic regions previously associated with smoking-related behaviors. Although GWAS meta-analyses of smoking behaviors have included upwards of 500,000 individuals, it is likely that additional genetic loci of small effect influence smoking behaviors and remain undetected by GWAS. Therefore, our analyses excluding known smoking-associated regions may not fully account for the contribution of smokingassociated genomic variation to our traits in our LDSR analyses. We present all our results, including these smoking sub-analyses, in the supplemental material.
Using cross trait LDSR, we have identified positively and negatively correlated traits with lung cancer. These findings indicate that shared genetic backgrounds exist between these traits, including alcohol use, educational attainment, fitness, and several other specific traits with lung cancer development. Our work should be viewed as a considerable step towards understanding the shared genetic architecture between these traits and lung cancer. A potential next step in future investigations is to perform causal analyses on strongly correlated traits we have described. Mendelian randomization studies may help determine causal versus mere association between these traits and the development of lung cancer. Ultimately identifying causal relationships may help to understand the shared genetic architecture of these traits with lung cancer, as well as to accurately create predictive risk models for lung cancer development. While causal modeling has an important role, it requires identifying and specifying sets of markers that can reliably represent intermediate traits. The LD Score regression approach evaluates the entire genome and so should be a more powerful filter for future causal modeling, once adequate genetic predictors for each of the traits that have been identified in our analysis are available.