Introduction

Thyroid cancer (TC) is an endocrine tumor arising from the parafollicular cells (medullary thyroid cancer, MTC) or the follicular cells (differentiated thyroid cancer, DTC) of the thyroid gland. Although it is relatively rare, it is the most common endocrine tumor, showing a relatively high incidence in Italy where an age standardized rate (ASR) of 13.5/100,000 was reported (http://eco.iarc.fr/eucan).

The best ascertained risk factor for DTC initiation and progression is exposure to ionizing radiation1. Family history and inherited genetic variants play also an important role in the disease, as demonstrated by family linkage studies, candidate-gene association studies and genome-wide association studies (GWASs)2.

Risk loci at 2q35 (DIRC3), 9q22.33 (FOXE1) and 14q13.3 (NKX2-1) were identified in GWASs carried out on an Icelandic population and confirmed across different populations3,4,5,6,7,8,9,10. In 2013, we reported the results of a GWAS based on a high-incidence Italian population and the previous associations for 9q22.33 and 2q35 loci were confirmed in the Italian cohort and in the combined analysis of four different cohorts (Italian, Polish, Spanish and UK), respectively. Moreover, in the first replication study risk loci at 3q25.32 (RARRES1), 7q21 (IMMP2L) and 9q34.3 (SNAPC4/CARD9) were associated with DTC only among Italians. In the second study based on the Italian GWAS, a strong relationship of DTC predisposition was found with SNPs on 20q11.22-q12 (DHX35) and 14q24.3 (BATF) across different populations. In addition, 5q14 (ARSB) and 13q12.12 (SPATA13) were associated with the disease only among Italians. These results supported the idea of genetic heterogeneity between populations and suggested the hypothesis of Italian-specific DTC susceptibility alleles11,12.

In this report, 32 SNPs selected from the previous Italian GWAS were analyzed in a large cohort and the functional role of the best associated SNPs was investigated by using ENCODE project data and by expression quantitative trait loci (eQTL) analyses. Furthermore, results from the present analysis were combined with those obtained in the previous Italian studies and the cumulative effect of the GWAS-identified SNPs on DTC risk was evaluated.

Results

In the first phase of this study 32 SNPs were replicated in a large Italian cohort consisting of 1,539 DTC cases and 1,719 controls (the study populations are described in Kohler et al. (2013)11 and in Supplementary Table S1). Results obtained in this phase and in the previous GWAS are reported in Supplementary Table S2. One marker (rs1358175) demonstrated deviation from Hardy-Weinberg Equilibrium in controls (p-value < 0.005) and was excluded from the analysis. A statistically significant association at p-value < 0.05 and at the same direction as in the GWAS was found for rs10864251, rs4908581, rs1400967, rs290219, rs7935113, rs4624074 and rs1203952. Additionally, an association with DTC in the same direction of the GWAS was obtained for rs11130536 and rs3863973, although not significant.

Combining the GWAS data with the present Italian results (2,260 cases and 2,218 controls), SNPs rs7935113 (OR = 1.36, 95% CI 1.20–1.53, p-value = 7.41 × 10−7) and rs1203952 (OR = 1.29, 95% CI 1.16–1.44, p-value = 4.42 × 10−6) reached close to a genome-wide significance (Table 1). However, these strong associations were not confirmed in the replication studies on the Polish and the Spanish populations. Consequently, the combined analysis of all replication studies (Italian, Polish, Spanish) and the joint analysis of all cohorts (GWAS, Italian, Polish, Spanish) did not reach a genome-wide significance, in agreement with a high heterogeneity between the study populations. None of the remaining loci were associated with the disease (Table 1).

Table 1 Risk of differentiated thyroid cancer in all cohorts

To increase the knowledge about the associated regions imputation over 200 Kb intervals spanning the associated loci were performed. At 11p15.3, in the intronic region of the GALNTL4 gene, the LD block including the index SNP rs7935113 defined the association. This block is located in a weak transcribed region in GM12878 cell line (Figure 1A). HaploReg v2 analysis revealed that rs7935113 risk allele removes a MZF1 transcription factor binding site (TFBS) and creates a SRF TFBS. This variation also affects the chromatin structure and enhancer histone markers on different cell lines as demonstrated by the ENCODE project data (Supplementary Table S3). eQTL analysis on lymphoblastoid cells demonstrated a significant association between rs7935113 and GALNTL4 expression level (Figure 1B; p-value = 0.03). However, this correlation was not found in thyroid tissues (data not shown). Several functional roles were also predicted for SNPs in LD with rs7935113 due to location in enhancer histone marker sites in different cell lines (e.g. GM12878, HSMM and HMEC) and alterations of TFBSs (Supplementary Table S3).

Figure 1
figure 1

Regional association plots and functional prediction of the strongest associated variants.

On the left (A and C) the regional plots are shown. In each plot, -log(10)p-value (y-axis) of SNPs are shown according to their chromosomal position (x-axis). SNP of interest is indicated by a violet circle. SNPs that were genotyped in the GWAS are marked by circles; imputed SNPs are marked as squares. The color of the SNPs represents the strength of the linkage disequilibrium with the SNP of interest. Blue line indicates local recombination rate (cM/Mb). On the bottom, the chromatin state segmentation profile (ChromHMM) in lymphoblastoid cell-line is reported. On the right (B and D), the correlation between gene expression levels and SNP genotypes are shown according to the data available at SNPexp (http://tinyurl.com/snpexp) on lymphoblastoid cell lines (B) and at the GTEx Portal (http://www.gtexportal.org/home/) on thyroid tissues (D).

At 20p11, a large region near FOXA2 is associated with low p-values to DTC risk, as indicated by the index SNP rs1203952, by GWAS typed SNPs (e.g. rs910956, rs2424440 and rs1203930) and by imputed SNPs (e.g. rs4815130, rs747912, rs910957 and rs1203953). The ChromHMM on GM12878 cells did not show any functional role for this region (Figure 1C). However, rs1203952, located 52 kb upstream of FOXA2, is predicted to alter the binding of Evi-1, Foxp1, Pou2f2 and SIX5 TFs and of the regulatory protein FOXA1 and to have an effect on the chromatin structure in MCF-7 cells. Similar roles were predicted for SNPs in LD with the index SNP rs1203952. Interestingly, according to the GTex data, SNPs belonging to this LD block are associated with a differential expression of FOXA2 gene in thyroid tissues (p-value = 1.5 × 10−6) (Figure 1D and Supplementary Table S3).

We investigated the cumulative effect of 11 independent susceptibility SNPs in 10 genes (DIRC3, IMMP2L, RARRES1, SNAPC4/CARD9, BATF, DHX35, ARSB, SPATA13, GALNTL4, FOXA2) in the Italian population (1,791 cases and 1,588 controls; Supplementary Table S4). The risk allele distribution in both cases and controls followed a normal distribution, but with a shift toward a higher number of risk alleles among cases (Figure 2A). A dose-dependent increase in risk of DTC was observed with an increasing number of risk alleles (OR per allele = 1.30, 95% CI 1.26–1.35; p-trend = 3.13 × 10−47). Compared to those with ≤7 risk alleles individuals with ≥14 risk alleles had 7.68 times higher risk of getting DTC (Supplementary Table S5, Figure 2B). As the well-established FOXE1 SNP rs965513 was genotyped only in the GWAS samples, we analyzed the effect of all the 12 SNPs associated with DTC in the Italian population in these samples (642 cases and 416 controls). The same trend was observed (OR per allele = 1.52, 95% CI 1.42–1.63; p-trend = 1.71 × 10−32) with individuals carrying ≥14 risk alleles having a 27.45 times higher risk of DTC than those with ≤7 risk alleles (Supplementary Table S6 and Supplementary Figure S1). This 3.5 times higher risk after including the FOXE1 SNP is explained by the concept of a “winners curse”, which implies that by chance the first reported association is higher than the one obtained in the replication. Finally, we estimated the proportion of variance in DTC risk (on the liability scale) explained by the identified susceptibility SNPs. We assumed a prevalence of DTC among Italians to be 0.01 and found that the 12 SNPs explained about 4% of the disease risk. The FOXE1 SNP alone explained less than 1% of the risk.

Figure 2
figure 2

Cumulative risk assessment.

(A) Sample distribution according to the number of risk alleles in eleven SNPs associated in the Italian DTC cases (black columns) and controls (grey columns). (B) Plot of the increasing ORs for DTC with increasing number of risk alleles. The category ≤7 was chosen as reference (OR = 1.0); vertical bars correspond to 95% confidence intervals.

Discussion

The present study had two main goals: to identify novel SNPs predisposing to DTC and to assess the cumulative effect of the SNPs identified so far in the Italian population. We analyzed 32 SNPs from the loci that showed evidence of association with DTC in our recent GWAS. The associations between DTC and SNPs rs7935113 and rs1203952 reached close to a genome-wide significance in the combined Italian populations. Risk of the disease widely increased with increasing number of risk alleles, which altogether explained about 4% of the genetic variance in DTC susceptibility.

The common variant rs7935113 is located in an intronic region of GALNTL4 at 11p15.3. This gene is a member of a large subfamily of the galactosaminyltransferases (GalNAc-Ts). Although little is known about its biological function, it is well established that abnormal O-glycan production contributes to the malignant phenotype and plays an important role in cell adhesion, invasion and metastasis13. Our eQTL analysis indicated that the rs7935113-C risk allele leads to an increased expression of GALNTL4. This is consistent with previous data which found various genes encoding GALNAC-Ts to be overexpressed in malignant tissues compared with normal tissue, such as GALNT3 in pancreatic cancer and GALNT6 in breast cancer14,15. Moreover, rs7935113 alters the binding sites of SRF and MZF1 transcription factors. To date, various studies have shown that SRF is involved in important cellular processes such as expression of tissue-specific genes, cell proliferation, differentiation and apoptosis16,17,18,19. SRF has been found significantly up-regulated in PTC and anaplastic carcinoma as compared to non-tumor thyroid tissues. Moreover, in vitro assays have indicated that overexpression of SRF in thyroid cancer cells enhances the expression level of c-Fos protein, cell migration and invasiveness20. The role of MZF1 in thyroid cancer has never been investigated, however, it was involved in colorectal, cervical and bladder carcinogenesis21,22,23.

rs1203952 is located on 20p11, upstream of FOXA2 (also known as HNF3B), belonging to the same forkhead-box (FOX) TF family of FOXE1, the strongest risk gene associated with DTC so far. As FOXE1, FOXA2 is able to bind TPO promoter and to modulate its transcriptional activity24. Thus, FOXE1 and FOXA2 could act together in the regulation of thyroid hormones triiodothyronine (T3) and thyroxine (T4) synthesis. The eQTL analysis demonstrated that homozygocity for rs1203952-G risk allele is associated with a decreased level of FOXA2 expression in thyroid tissues. This result is consistent with a previously published study, showing that thyroid cancer cells had a decreased expression of FOXA2 when compared to normal cells and FOXA2 forced re-expression was associated with cell growth inhibition25. Interestingly, FOXA2 is a methylated gene in breast and lung cancer cells and its overexpression in a lung cancer cell line led to growth arrest and apoptosis26,27. The down-regulation of FOXA2 expression in thyroid tissues could be in part explained by the action of the regulatory protein FOXA1 and of the TFs Evi-1, Foxp1, Pou2f2 and SIX5, which binding sites are altered by rs1203952 and for which a role in cell proliferation and differentiation has already been reported28,29,30,31,32. In particular, the mechanism of action of FOXA1 was described also in anaplastic TC, where it regulates the expression of the cell cycle inhibitor p27kip1 and promotes cell proliferation33.

Given these observations, rs7935113 and rs1203952 could lead to DTC development and progression by altering GALNTL4 and FOXA2 expression through a different recruitment of TFs. Besides these mechanisms, many genetic variants in the LD block containing rs7935113 and rs1203952 could alter the binding of proteins and the function of regulatory elements (e.g. enhancers), suggesting that many other transcriptional regulatory mechanisms could explain the identified association signals. Fine-mapping studies of 11p15.3 and 20p11 loci are warranted to check whether other functional variants, including rare genetic variants, could explain the observed associations.

The second objective of this work was to determine the combined effect of the risk variants identified so far in the Italian population on DTC risk. We showed that, although individual susceptibility alleles have only a modest effect on the disease, the risk widely increases when the alleles are combined. The additive effect of the SNP alleles on DTC risk was also recently reported for five GWAS-identified SNPs (rs965513, rs944289, rs966423, rs2439302 and rs116909374) in Polish and Ohio cohorts6. Moreover, our results suggested that the variation in DTC risk in the Italian population is explained in part by the 12 SNPs identified so far (about 4%). We note that this value most likely represents the lower bound for the contribution of these 12 loci, although this type of approaches that calculate heritability of disease liability have been recommended for the estimation of the additive genetic contribution of common SNPs in a complex disease susceptibility34. The low proportion of genetic variance explained by the identified risk alleles and the small proportion of a general population that is expected to be in the highest risk group, less than 5% in the Italian control population had ≥14 risk alleles, restrict currently the usefulness of genetic data in the clinical practice. Taken together these data are consistent with a multifactorial and polygenic model of DTC susceptibility. Future genetic research on larger sample sets and novel technologies, such as array-based fine-mapping and next generation sequencing, are warranted to identify rare high-penetrance variants and gene-gene interactions that could explain a higher percentage of genetic and phenotypic variance in DTC.

In conclusion our study provides further insights into inherited susceptibility to DTC among Italians and highlights the importance of genetic association studies.

Methods

Ethics statement

Study participants were recruited according to the protocols approved by the institutional review boards in accordance with the Declaration of Helsinki. All subjects provided written informed consent to participate in the study.

Study populations

This study was conducted on three sets of samples (Italian, Polish and Spanish), which were described elsewhere11 and reported in Supplementary Table S1. All cases and controls were of Caucasian origin. Briefly, the Italian replication cohort included 1,539 DTC patients attending the University Hospital Cisanello in Pisa. The control group (1,719) was recruited from individuals without any thyroid disease and cancer history: 1,079 were workers of the same hospital of Pisa and 640 were blood donors from the Meyer Hospital in Florence. The Polish group comprised of 468 DTC patients and 470 healthy controls from the Department of Nuclear Medicine and Endocrine Oncology, Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology in Gliwice. The Spanish cohort consisted of 446 DTC cases, recruited by the Department of Genetics and Microbiology of Autonomous University of Barcelona and 420 healthy individuals.

DNA isolated from peripheral blood leukocytes (Italian, Polish and Spanish cohorts) and oral mucosa cells (Spanish cohort) were used. DNA was extracted according to the protocols used in respective institutions which provided the samples. DNA concentration was evaluated with NanoDrop spectrophotometer. For Italian and Polish samples, whole genome amplification was performed using Illustra GenomiPhi V2 DNA Amplification Kit (GE Healthcare) according to the manufacturer's protocol.

SNP selection and genotyping

Candidate SNPs were selected based on the results of the Italian GWAS reported by Köhler and coworkers where the best 250 SNPs were already investigated11,12. Here, the following 250 SNPs were visually screened for the quality of their clustering pattern. The Manhattan plots (±100 kb from the SNP position) were also investigated and the SNPs were screened for other SNPs in linkage disequilibrium (LD) with the variant of interest. Finally, we selected 32 SNPs for further evaluation. All of them represented a region with at least two SNPs associated with DTC.

Genotyping was carried out using the TaqMan SNP genotyping assays (Life Technologies) according to the manufacturer's guidelines. To assure the genotyping reliability, repeated analysis was performed in a randomly selected 10% of samples (the average concordance rate was 99%). After excluding samples with more than 50% missing genotypes, all markers had a call rate greater than 95%, with a mean call rate of 98%. Original GWAS samples were re-genotyped (rs7935113 and rs1203952) and the results confirmed the GWAS data (concordance > 99%).

Statistical analysis

We tested the genotype distributions in controls for Hardy-Weinberg equilibrium by using the chi-square test. For each SNP logistic regression analysis was performed to determine allelic odds ratios (ORs) with 95% confidence intervals (95% CIs) and allelic p-values. The calculations were done for each cohort separately for unadjusted models as well as with adjustment for sex and age/age at diagnosis. The results for adjusted models were similar to the unadjusted ones and are not reported. For Italian cohorts, further adjustments for the enrollment center (University Hospital Cisanello in Pisa and Meyer Hospital in Florence) or the place of birth (Southern Italy or Northern and Central Italy) did not substantially change the associations. For all replication studies combined, calculations were carried out correcting for age, sex and cohort. The Cochran's Q-statistics was calculated to test for heterogeneity and the I2 statistics to quantify the proportion of the total variation due to heterogeneity. All these analyses were performed using SAS version 9.2 (SAS Institute In., Cary, NC, USA).

Computational analysis

To evaluate the associated loci more thoroughly, we imputed genotypes of all SNPs that were not genotyped in the GWAS located 100 Kb upstream or downstream of the most significant SNP. We employed genotype information from the CEU panels of the publicly available HapMap3 (www.hapmap.org/) and 1000 Genomes Project (www.1000genomes.org/) databases. We used the software IMPUTE2 to perform imputation analysis on each associated locus. Regional plots were generated using LocusZoom (http://csg.sph.umich.edu/locuszoom/).

We searched SNPs in high LD (r2 ≥ 0.8) with SNPs that showed the strongest associations with DTC predisposition based on the CEU data of the 1000 Genomes Project pilot release (www.1000genomes.org/). To explore the epigenetic profile of the best associated regions, we checked the chromatin state segmentation profile (ChromHMM) in lymphoblastoid cells (GM12878) generated by the ENCODE project and available at the UCSC Genome Browser (http://genome.ucsc.edu/). To assess the possible functional role of each SNP we used the ENCODE-based tool HaploReg v2 (www.broadinstitute.org/mammals/haploreg). In addition, we examined the eQTL data available for lymphoblastoid cells and thyroid tissues by using SNPexp (http://tinyurl.com/snpexp) and GTEx Portal (http://www.gtexportal.org/home/), respectively.

Assessment of cumulative risk

We assessed the cumulative effect of the independent significant SNPs identified in our previous analyses and in the present study in the Italian population. For each SNP the genotypes were coded as 0, 1 or 2 indicating the number of risk alleles in the genotype and individuals were grouped into categories based on the number of risk alleles (≤7, 8, 9, 10, 11, 12, 13 and ≥14). To avoid any bias due to missing data, samples with one or more missing genotypes were not included. ORs were calculated comparing the groups defined by varying number of risk alleles to the group with the lowest number of risk alleles. This analysis was performed by Statgraphics Centurion software (StatPoint, USA) and the R program. Additionally, we used the Genome-wide Complex Trait Analysis (GCTA) program (http://www.complextraitgenomics.com/software/gcta/) to calculate the proportion of DTC variance (using a liability model) that is explained by the significant SNPs identified so far in the Italian population35. In a sample of independent individuals, this method uses a random effects mixed linear model to compare a matrix of pairwise genomic similarity with a matrix of pairwise phenotypic similarity.