mRNA splicing is an important mechanism to regulate mRNA expression. Abnormal regulation of this process may lead to lung cancer. Here, we investigated the associations of 11,966 single-nucleotide polymorphisms (SNPs) in 206 mRNA splicing-related genes with lung cancer risk by using the summary data from six published genome-wide association studies (GWASs) of Transdisciplinary Research in Cancer of the Lung (TRICL) (12,160 cases and 16,838 controls) and another two lung cancer GWASs of Harvard University (984 cases and 970 controls) and deCODE (1,319 cases and 26,380 controls). We found that a total of 12 significant SNPs with false discovery rate (FDR) ≤0.05 were mapped to one novel gene PRPF6 and two previously reported genes (DHX16 and LSM2) that were also confirmed in this study. The six novel SNPs in PRPF6 were in high linkage disequilibrium and associated with PRPF6 mRNA expression in lymphoblastoid cells from 373 Europeans in the 1000 Genomes Project. Taken together, our studies shed new light on the role of mRNA splicing genes in the development of lung cancer.
Lung cancer is a major challenge to human health and caused by multiple environment and genetic factors1. Smoking, radon gas, asbestos and air pollution have been established as the environment risk factors2, and genetic factors have also been clearly illustrated in familial3 and segregation studies4. Single nucleotide polymorphisms (SNPs) are the most common genetic variants in the human genome and have been shown to be associated with risk of human diseases, including cancers5. With a large sample size of study subjects, multiple-stage replication and high-throughput chips, genome-wide association studies (GWASs) have been shown to be a robust way of detecting genetic variants involved in susceptibility to cancer. To date, GWASs in different ethnic groups have successfully identified 30 loci in 12 chromosomal regions (3q28, 5p15.33, 6p21.1, 6p21.33, 6q22.1, 7p15.3, 10q25.2, 13q12.12,13q13.1, 15q25.1, 17q24.2, 22q12.2) that confer susceptibility to lung cancer6,7,8,9,10,11,12,13,14,15,16,17, among which, loci at 5p15.33, 6p21.33 and 15q25.1 were found to be risk factors for lung cancer in European descents.
Despite the great success achieved by GWASs, the identified SNPs by GWASs still only explain a small fraction of heritability of lung cancer, a phenomenon called “missing heritability”18. Therefore, many complementary approaches have been developed to improve the study power of GWASs. For example, one of such approaches is the pathway-based association analysis through additional imputation to increase the number of SNPs to be analyzed, which could detect the risk-associated genes with multiple, independent and small effect sizes. This approach could alleviate issues due to insufficient chip coverage of earlier GWASs, reduce the multiple-testing burden of GWAS, achieve consistent results across studies and provide understanding of genetic findings with some functional relevance19,20,21,22.
Among the biological candidate pathways for lung carcinogenesis, mRNA splicing, a modification of the pre-mRNA transcript in which introns are removed and exons are joined, is of significant importance23. In most of the circumstances, this pathway functions normally for the human genome to generate proteomic diversity sufficient for various biological processes. However, cancer cells may take advantage of this mechanism to produce aberrant proteins with added, deleted, or altered functional domains that contribute to carcinogenesis including lung cancer24. Recent studies had shown that somatic mutations and germline variants of some mRNA splicing-related genes were associated with development of lung cancer. For instance, Shen et al. found several SNPs in mRNA splicing associated genes (SRSF7, PTBP2 and HNRNPQ) were associated with lung cancer risk in a Chinese population25. To date, however, only a limited number of candidate genes and SNPs in the pathway have ever been studied and reported. In the present study, we systematically selected 206 mRNA splicing-related genes and comprehensively investigated associations between 11,966 genetic variants of these genes and lung cancer risk using eight published lung-cancer GWAS datasets from the Transdisciplinary Research in Cancer of the Lung (TRICL) Consortium.
Overall associations between SNPs in mRNA splicing-related genes and lung cancer risk in the TRICL Consortium
Overall, 11,966 SNPs in 206 mRNA splicing-related genes in six GWAS datasets were selected from the TRICL Consortium (an imputated dataset that included 5,472,374 common SNPs), and their associations with lung cancer risk are shown in the Manhattan plot (Fig. 1). The sample size for each study had been listed in Table 1. After FDR correction (PFDR corrected ≤ 0.05), the 12 SNPs in three genes remained statistically significant (Fig. 1). The most significant SNP for each gene was: rs75100087 in PRPF6 (P = 6.83E-06); rs115420460 in DHX16 (P = 3.93E-09) and rs114312980 in LSM2 (P = 2.19E-07). Their basic information and associations with lung cancer risk are listed in Table 2. SNPs in DHX16 (rs115420460) and LSM2 (rs114312980, rs115489726, rs115801685, rs114637560, rs115834633) were excluded from further analysis due to their locations within the previously identified lung cancer susceptible region Chr6p21.33 and in high LD with previously reported lung cancer GWAS SNPs. Specifically, as shown in Supplement Figure 2, rs115420460 in DHX16 was in high LD (r2 > 0.6) with the previously reported lung cancer GWAS SNP rs43247989. Similarly, Supplement Figure 3 shows that all of the five SNPs in LSM2 were in high LD (r2 > 0.8) with SNP rs3117582 that was identified from a the previously published lung cancer GWAS14. Therefore, we focused on the remaining six newly identified SNPs in PRPF6 (Table 2) for further analysis. Regional plot and recombination rates of PRPF6 (rs8126213 ± 500 kb) in the TRICL consortium are presented in Fig. 2.
LD analysis and SNP function annotation
The diagram of six PRPF6 SNPs and their LD plot are shown in Supplement Figure 4b and 4c, respectively. High LD was observed between each pair of the six SNPs (r2 > 0.80), indicating that any of these six SNPs can be a good tag for others. Next, we exploited dbSNP function annotation (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?showRare=on&chooseRs=all&go=Go&locusId=24148) and another two online SNP function prediction tools: SNPfunc (http://snpinfo.niehs.nih.gov/snpinfo/snpfunc.htm) and regulomeDB (http://regulomedb.org/index) to assess their functionality (Supplementary Table 2).
Expression quantitative Trait loci (eQTL) analysis
Expression quantitative trait loci (eQTL) analysis, which directly investigates the correlations between genetic variants and gene expression, has been gradually proved as an effect method to characterize the function of SNPs. In the present study, the eQTL analysis was performed by using the data available from lymphoblastoid cell lines derived from 373 individuals of European descendent (http://www.1000genomes.org/). Finally, all of these six SNPs in PRPF6 were found to be significantly associated with expression levels of PRPF6 in both additive and dominant models (Supplementary Figure 5 and Supplementary Table 2). We also queried the GTEx database (http://www.gtexportal.org/) and found that SNP rs8126213 was significantly correlated with mRNA expression levels of PRPF6 in normal lung tissues (P = 1.50 × 10−5), which was consistent with the results in the lymphoblastoid cell lines. Similar results were found for other five SNPs (Supplementary Figure 6).
Replication of PRPF6 SNP (rs8126213) in another two lung-cancer GWASs
We then selected one of these six PRPF6 SNPs (rs8126213) and investigated its association with lung cancer risk in two additional lung-cancer GWASs of Caucasian origin, Harvard Lung Cancer Study (984 cases and 970 controls) and Icelandic Lung Cancer Study (deCODE from the ILCCO) (1,319 cases and 26,380 controls). However, no significant association was observed in these two studies (P = 0.801 and 0.850, respectively), which may potentially result from a relatively limited sample size of the cases, although their effects were in the same direction as in the TRICL‐ILCCO studies, and the overall effect among all these eight GWASs remained significant (OR = 0.91, 95% CI = 0.87–0.95, P = 6.36E-05 in an additive genetic model) (Table 3). Similar effects were also observed on adenocarcinoma (OR = 0.91, 95% CI = 0.87–0.95, P = 0.002) and squamous cell carcinoma (OR = 0.90, 95% CI = 0.83–0.97, P = 0.005) in the subgroup analysis.
mRNA splicing is an important biological mechanism to regulate mRNA expression. Somatic mutations and germline variants of the mRNA splicing-related genes have been found to be associated with various kinds of cancers. For example, SF3B1, U2AF1 and SRSF2 mutations were observed in myeloid and lymphoid lineage tumors26. SF3B1 mutations were commonly found in uveal melanomas27. Mutations affecting spliceosome genes that resulted in defective splicing were attributed to leukemia28. For lung cancer, it had been reported that U2AF1 mutation was found in 3% of lung adenocarcinoma cases29. However, few studies reported a role of genetic variants in the etiology of cancer. For example, only one recent study found that genetic variants in three mRNA splicing-associated genes might modify individual susceptibility to lung cancer in a Chinese population25.
In the present study, we systematically evaluated the associations between 11,966 genetic variants in 206 mRNA splicing-related genes and lung cancer risk using the data from eight published GWAS datasets. To the best of our knowledge, this is the first and largest study focusing on exploring associations between SNPs from mRNA splicing-related genes and risk of lung cancer. Overall, our study identified six SNPs within mRNA splicing-related gene PRPF6 that might play an important role in the development of lung cancer. Furthermore, all of these six SNPs were found to be associated with PRPF6 mRNA expression in both lymphoblastoid cell lines and normal lung tissues of European descendent. Therefore, we proposed that these six SNPs were associated with abnormal expression of PRPF6 and thus modified individual’s susceptibility to lung cancer. However, two major issues of the present study should be addressed. Firstly, the associations of these six SNPs with risk of lung cancer were observed in the six TRICL GWASs but not in Harvard and deCODE GWASs, probably due to a relatively limited sample size of the cases in the replication. Further replication studies are warranted to confirm our results. Secondly, the lung cancer-associated mRNA splicing-related genes identified in the previous Chinese study25 were not replicated in our study, which could be attributed to different genetic background or different environmental exposures as well as other unknown contributing factors between two populations.
PRPF6 is located on chromosome 20q13.33 and is an essential member of the small nuclear ribonucleic proteins (snRNPs), playing an important role in mRNA splicing. For instance, a missense mutation in PRPF6 impairs mRNA splicing and contributes to Autosomal-Dominant Retinitis Pigmentosa30. Its role in the development of cancer had also been reported in previous studies. For example, it was over-expressed in lung adenocarcinomas according to The Cancer Genome Atlas (TCGA) projects31. In addition, PRPF6 was over-expressed in colon cancer32 to drive cancer proliferation by preferential splicing of genes associated with growth regulation33. Abnormal expression of PRPF6 alters the constitutive and alternative splicing of a discrete number of genes, including an oncogenic isoform of the ZAK kinase33 that activates several cancer (including lung cancer)-related signaling pathways, such as those of NF-kB, Wnt/b catenin, AP1, ERK and JNK34,35. It should be mentioned that the protective alleles of these six SNPs were associated with both a decreased lung cancer risk and a decreased mRNA expression of PRPF6, which is consistent with the previous findings of overexpression of the gene in colon and lung cancers. Therefore, based on all these, we hypothesized of abnormal expression of PRPF6 might contribute to development of lung cancer in the following way: the risk allele of PRPF6 was associated with elevated PRPF6 mRNA expression, then the oncogenic isoforms, such as the ZAK kinase, were generated as a result of abnormal splicing and activation in normal lung tissues, which finally resulted in lung carcinogenesis.
In summary, the present large-scale meta-analysis of eight published lung-cancer GWASs consisting of 14,463 lung cancer cases and 44,188 controls revealed a novel lung cancer susceptibility locus in the mRNA splicing-related gene PRPF6 and provided some new insight into genetic architecture and carcinogenic mechanisms of lung cancer. Further validation and functional evaluation of this genetic variant are warranted to verify our findings.
Materials and Methods
In the present study, we used the data from eight lung-cancer GWASs with a total of 14,463 lung cancer cases and 44,188 health controls. As shown in Table 1, the analysis included six GWAS datasets from the TRICL consortium (12,160 lung cancer cases and 16,838 controls of European ancestry) and another two datasets of lung cancer GWASs: the Caucasian origin-Harvard Lung Cancer Study (984 cases and 970 controls) and the Icelandic Lung Cancer Study (deCODE from the ILCCO) (1,319 cases and 26,380 controls). As described previously16, the TRICL six lung-cancer GWASs include the MD Anderson Cancer Center (MDACC) GWAS, the Institute of Cancer Research (ICR) GWAS, the National Cancer Institute (NCI) GWAS, the International Agency for Research on Cancer (IARC) GWAS, Lunenfeld-Tanenbaum Research Institute study (Toronto) GWAS and the German Lung Cancer Study (GLC) GWAS.
All of the subjects included in the analysis had provided a written informed consent. All methods were performed in accordance with the relevant guidelines and regulations, and all original studies were approved by the institutional review board from each of the participating institutions.
GWAS genotyping, imputation and quality controls
GWASs included in the current study were performed by the following platforms: Illumina HumanHap 317, 317 + 240 S, 370 Duo, 550, 610 or 1 M arrays. Genotype imputation was conducted based on linkage disequilibrium (LD) information from the 1000 Genomes Project (phase I integrated release 3, March 2012) by IMPUTE2, MaCH or minimac software16. Imputed SNPs with information score <0.40 by IMPUTE2 or r2 < 0.30 by MaCH were excluded from further analysis. In addition, standard quality control on samples by excluding individuals with low call rate (<90%) and extremely high or low heterozygosity (P < 1.0 × 10−4), as well as those estimated to be of non-European ancestry (using the HapMap phase II CEU, JPT/CHB and YRI populations as reference) were conducted among all of the studies.
Genes and SNPs selection
The mRNA splicing-related genes were selected with the combination of two online databases commonly exploited in the gene-set enrichment analysis: Molecular Signatures Database (http://www.broadinstitute.org/gsea/index.jsp) and Genecards (http://www.genecards.org/). Overall, 206 mRNA splicing-related genes were selected (Supplementary Table 1), which included a total of 11,966 genotyped or imputed common SNPs extracted from these genes (include 2-kb upstream and downstream of genes) among the TRICL Consortium with the following inclusion criteria: 1) Genotyping call rate ≥ 90%; 2) Minor allele frequency (MAF) ≥ 5% among Europeans; and 3) Hardy Weinberg Equilibrium exact P value ≥ 10−5. The detailed work-flow can be found in Supplementary Figure 1.
The associations between SNPs and lung cancer risk were evaluated using an additive genetic model by R (v2.6), Stata (v10, State College, Texas, US) and PLINK (v1.06) software36. For the studies of ICR, MDACC, IARC, Toronto, NCI and Harvard, top significant principle components were included in the analysis to control for population stratification that might cause inflation of test statistics, while for the deCODE study, genomic control method was used to adjusted for population stratification. No population structure was found in the German Lung Cancer Study (GLC). With the application of the inverse variance method, a meta-analysis under the fixed and random-effects models was performed on the results of a log-additive model for 11,966 SNPs. In order to assess the heterogeneity among studies, the Cochran’s Q statistic to test for heterogeneity and the I2 statistic to quantify the proportion of the total variation due to the heterogeneity were calculated31. A fixed-effects model was applied, if no heterogeneity existed among studies (PQ-test > 0.10 and I2 < 25%); otherwise, a random-effects model was chosen. The Benjamini–Hochberg false discovery rate (FDR) procedure was employed for the correction of multiple testing with a cutoff value ≤ 0.05.
Regional association plots were generated with LocusZoom on the basis of 1000 Genomes European (EUR) reference data (phase I integrated release 3, March 2012)37. Haploview v4.2 was employed to construct the Manhattan and LD plot. All analyses were conducted with SAS (version 9.1.3; SAS Institute, Cary, NC, USA) unless specified otherwise.
In silico SNP function annotation
To prioritize functional SNPs, three in silico tools: dbSNP func annotation (http://www.ncbi.nlm.nih.gov/projects/SNP/), SNPinfo (http://snpinfo.niehs.nih.gov/), and RegulomeDB (http://regulomedb.org/) were employed. Furthermore, the associations between SNPs and PRPF6 mRNA expression were performed using lymphoblastoid cell expression data from 1000 Genomes Project European populations (EUR, 373 individuals) (phase I integrated release 3, March 2012)38 by linear regression model.
How to cite this article: Pan, Y. et al. Associations between genetic variants in mRNA splicing-related genes and risk of lung cancer: a pathway-based analysis from published GWASs. Sci. Rep. 7, 44634; doi: 10.1038/srep44634 (2017).
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors thank the investigators and participants of all TRICL-ILCCO studies (i.e., ICR, MDACC, IARC, NCI, Toronto, GLC, Harvard and deCODE) for their important contributions. As Duke Cancer Institute members, QW and KO acknowledge support from the Duke Cancer Institute as part of the P30 Cancer Center Support Grant (Grant ID: NIH CA014236). QW was also supported by a start-up fund from Duke Cancer Institute, Duke University Medical Center. Yongchu Pan was sponsored by Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University. TRICL: This work was supported by the Transdisciplinary Research in Cancer of the Lung (TRICL) Study, U19-CA148127 on behalf of the Genetic Associations and Mechanisms in Oncology (GAME-ON) Network. The Toronto study was supported by Canadian Cancer Society Research Institute (020214), Ontario Institute of Cancer and Cancer Care Ontario Chair Award to RH The ICR study was supported by Cancer Research UK (C1298/A8780 andC1298/A8362—Bobby Moore Fund for Cancer Research UK) and NCRN, HEAL and Sanofi-Aventis. Additional funding was obtained from NIH grants (5R01CA055769, 5R01CA127219, 5R01CA133996, and 5R01CA121197). The Liverpool Lung Project (LLP) was supported by The Roy Castle Lung Cancer Foundation, UK. The ICR and LLP studies made use of genotyping data from the Wellcome Trust Case Control Consortium 2 (WTCCC2); a full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Sample collection for the Heidelberg lung cancer study was in part supported by a grant (70–2919) from the Deutsche Krebshilfe. The work was additionally supported by a Helmholtz-DAAD fellowship (A/07/97379 to MNT) and by the NIH (U19CA148127). The KORA Surveys were financed by the GSF, which is funded by the German Federal Ministry of Education, Science, Research and Technology and the State of Bavaria. The Lung Cancer in the Young study (LUCY) was funded in part by the National Genome Research Network (NGFN), the DFG (BI576/2-1; BI 576/2-2), the Helmholtzgemeinschaft (HGF) and the Federal office for Radiation Protection (BfS: STSch4454). Genotyping was performed in the Genome Analysis Center (GAC) of the Helmholtz Zentrum Muenchen. Support for the Central Europe, HUNT2/Tromsø and CARET genome-wide studies was provided by Institute National du Cancer, France. Support for the HUNT2/Tromsø genome-wide study was also provided by the European Community (Integrated Project DNA repair, LSHG-CT- 2005-512113), the Norwegian Cancer Association and the Functional Genomics Programme of Research Council of Norway. Support for the Central Europe study, Czech Republic, was also provided by the European Regional Development Fund and the State Budget of the Czech Republic (RECAMO, CZ.1.05/2.1.00/03.0101). Support for the CARET genome-wide study was also provided by grants from the US National Cancer Institute, NIH (R01 CA111703 and UO1 CA63673), and by funds from the Fred Hutchinson Cancer Research Center. Additional funding for study coordination, genotyping of replication studies and statistical analysis was provided by the US National Cancer Institute (R01 CA092039). The lung cancer GWAS from Estonia was partly supported by a FP7 grant (REGPOT245536), by the Estonian Government (SF0180142s08), by EU RDF in the frame of Centre of Excellence in Genomics and Estoinian Research Infrastructure’s Roadmap and by University of Tartu (SP1GVARENG). The work reported in this paper was partly undertaken during the tenure of a Postdoctoral Fellowship from the IARC (for MNT). The Environment and Genetics in Lung Cancer Etiology (EAGLE), the Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC), and the Prostate, Lung, Colon, Ovary Screening Trial (PLCO)studies and the genotyping of ATBC, the Cancer Prevention Study II Nutrition Cohort (CPS-II) and part of PLCO were supported by the Intramural Research Program of NIH, NCI, Division of Cancer Epidemiology and Genetics. ATBC was also supported by US Public Health Service contracts (N01-CN-45165, N01-RC-45035 and N01-RC-37004) from the NCI. PLCO was also supported by individual contracts from the NCI to the University of Colorado Denver (NO1-CN-25514), Georgetown University(NO1-CN-25522), Pacific Health Research Institute (NO1-CN-25515), Henry Ford Health System (NO1-CN-25512), University of Minnesota(NO1-CN-25513), Washington University(NO1-CN-25516), University of Pittsburgh (NO1-CN-25511), University of Utah (NO1-CN-25524), Marshfield Clinic Research Foundation (NO1-CN-25518), University of Alabama at Birmingham (NO1-CN-75022, Westat, Inc. NO1-CN-25476), University of California, Los Angeles (NO1-CN-25404). The Cancer Prevention Study II Nutrition Cohort was supported by the American Cancer Society. The NIH Genes, Environment and Health Initiative (GEI) partly funded DNA extraction and statistical analyses (HG-06-033-NCI-01 andRO1HL091172-01), genotyping at the Johns Hopkins University Center for Inherited Disease Research (U01HG004438 and NIH HHSN268200782096C) and study coordination at the GENEVA Coordination Center (U01 HG004446) for EAGLE and part of PLCO studies. Funding for the MD Anderson Cancer Study was provided by NIH grants (P50 CA70907, R01CA121197, R01CA127219, U19 CA148127, R01 CA55769, and K07CA160753) and CPRIT grant (RP100443). Genotyping services were provided by the Center for Inherited Disease Research (CIDR). CIDR is funded through a federal contract from the NIH to The Johns Hopkins University (HHSN268200782096C). The Harvard Lung Cancer Study was supported by the NIH (National Cancer Institute) grants CA092824, CA090578, and CA074386. deCODE: The project was funded in part by GENADDICT: LSHMCT-2004-005166), the National Institutes of Health (R01-DA017932).
About this article
Associations of genetic variants in endocytic trafficking of epidermal growth factor receptor super pathway with risk of nonsyndromic cleft lip with or without cleft palate
Molecular Genetics & Genomic Medicine (2018)