A distinct class of pan-cancer susceptibility genes revealed by an alternative polyadenylation transcriptome-wide association study

Alternative polyadenylation plays an important role in cancer initiation and progression; however, current transcriptome-wide association studies mostly ignore alternative polyadenylation when identifying putative cancer susceptibility genes. Here, we perform a pan-cancer 3′ untranslated region alternative polyadenylation transcriptome-wide association analysis by integrating 55 well-powered (n > 50,000) genome-wide association studies datasets across 22 major cancer types with alternative polyadenylation quantification from 23,955 RNA sequencing samples across 7,574 individuals. We find that genetic variants associated with alternative polyadenylation are co-localized with 28.57% of cancer loci and contribute a significant portion of cancer heritability. We further identify 642 significant cancer susceptibility genes predicted to modulate cancer risk via alternative polyadenylation, 62.46% of which have been overlooked by traditional expression- and splicing- studies. As proof of principle validation, we show that alternative alleles facilitate 3′ untranslated region lengthening of CRLS1 gene leading to increased protein abundance and promoted proliferation of breast cancer cells. Together, our study highlights the significant role of alternative polyadenylation in discovering new cancer susceptibility genes and provides a strong foundational framework for enhancing our understanding of the etiology underlying human cancers.

for each cancer trait.For the P-Z plot, the X-axis is the Estimated -log10(P-value) and the y-axis is the Raw -log10(P-value).Examples of genes with lead cancer risk variants were located in 3′UTR regions.c.
Comparison of the standardized effect size of lead SNPs within 3′UTR, relative to other regions that did not show significant differences.d.Comparison of the standardized effect size of lead SNPs within 3′UTR and its downstream relative to other regions.
The two-sided P-values were obtained using a Wilcoxon sum-rank test.

Figure S2 .
Figure S2.Quantile-Quantile (QQ)-plot of GWAS summary data.QQ-plot shows the quality control for each cancer trait, and generally plots the observed -log10P-value (y-axis) versus the quantile distribution of excepted -log10 P-value (x-axis).

Figure
Figure S3.P-Z plot of GWAS summary quality.P-Z -plot shows the quality control

Figure S4 .
Figure S4.Characterization of cancer GWAS loci.a. Histogram of minor allele frequencies (MAFs) for unique leading SNPs compared with the whole genome.b.

Figure S5 .
Figure S5.Cross cancer genetic correlation analysis.Heatmaps display the absolute value of the genetic correlation (|  |) between different cancer types within (a) European and (b) East-Asian population.The color in the heatmap represents the correlation between the compared groups, with darker colors indicating stronger correlations.

Figure S6 .
Figure S6.Genetic correlation between different groups.Comparisons of the genetic correlations a. within cancers vs. between cancers and b. in Europeans vs. Asians.P-values were calculated from the Wilcoxon test (two-sided).**, P < 0.01; ***P < 0.001; ****, P<0.0001; ns, not significant.The center lines within the box plot signify the median values, while the boxes encompass the interquartile range (IQR) from the 25th to the 75th percentile and the outliers are shown as separate dots.

Figure S7 .
Figure S7.Overview of TCGA xQTL mapping.In this study, we utilized a linear regression framework implemented in Matrix eQTL to assess the association between normalized PDUI values and SNPs within a 1 Mb interval from the 3′UTR region.We adjusted for known covariates, including sex, RIN (RNA Integrity Number), platform, and the top-five genotype principal components.Additionally, we accounted for unobserved covariates using PEER, with the number of PEER covariates determined based on the GTEx Consortium's guidelines.To obtain reliable statistical significance, we conducted 1,000 rounds of permutation and derived empirical P-values for each gene.Finally, we applied the R package qvalue to adjust the empirical P-values for multiple testing.

Figure S9 .
Figure S9.Enrichment of the cis-3′aQTL around the PAS.Distance from the PAS to the corresponding SNP in a range of ± 1Mbp.0 represents PAS.

Figure S11 .
Figure S11.Sequence structure comparisons between colocalized 3′aGenes, eGenes and sGenes.P values were determined through a two-sided t-test.The center horizontal lines of the box plot show the median values and the boxes span from the 25th to the 75th percentile.n.s., not significant.The center lines within the box plot signify the median values, while the boxes encompass the interquartile range (IQR) from the 25th to the 75th percentile and the outliers are shown as separate dots.

Figure S12 .
Figure S12.Colocalized 3′aGenes have more AU-rich elements proximal to poly(A) sites compared with eGenes and sGenes.a. Boxplot showing the number of overall AU-rich elements proximal to poly(A) sites (ranged from -100 to 100 bp from the poly (A) site) in each gene, comparing 3′aGenes with eGenes and sGenes.Centre horizontal lines of Boxplot represent median values, boxes span from the 25th percentile to the 75th percentile.Whiskers extend to 1.5 × IQR, where IQR is the interquartile range.b.Boxplots for each specific motif variant proximal to poly(A) sites in 3′aGenes, eGenes, and sGenes.

Figure
Figure S13.3′aTWAS models across 49 tissues from GTEx cohorts and 18 tumor tissues from TCGA studies.a. Number of 3'aTWAS models across each tissue from GTEx and TCGA cohorts.b.The number of predictive models is highly correlated with the sample size in GTEx tissues.c.The number of predictive models is highly correlated with the sample size in TCGA studies.Each dot represents a tissue type.

Figure S14 .
Figure S14.Tissue-specific predictive models for APA usages.a. Cross-validation prediction accuracy of cis-regulated APA usage (R 2 ) for all 64,513 models.b.Prediction accuracy (  2 ⅈ ⁄ − ℎ  2 ) for all 64,513 models, calculated using crossvalidation R2 between predicted and true PDUI and normalizing by corresponding cisℎ  2 .c. Histogram of the number of reference panels per transcript.The x-axis represents the number of overlapping reference panels for each model, and the y-axis represents the number of 3′aTWAS models.d.Boxplot of prediction accuracy for 3′aTWAS models.The median values are represented by the center horizontal line, and the boxes indicate the 25th and 75th percentiles.The median value of the prediction accuracy is 0.69.

Figure S15 .
Figure S15.APA Transcriptome-wide association study results based on potentially relevant tissues.a-c.Manhattan plot of 3′aTWAS nominating the APAlinked susceptibility genes in pan-cancer (a), breast cancer (b), and prostate cancer (c) respectively.Colored points represent significant 3 ′ aTWAS associations at FDR<0.05.d.Bar plots showing the number of significant genes detected by 3′aTWAS, with a false-discovery rate (FDR) < 0.05, for different cancers in the most relevant tissues.Venn plot shows the intersection of significant cancer genes identified by 3′ aTWAS (FDR <0.05) with genes identified by expression and splicing TWAS.e. Heatmap showing the 3′aTWAS genes shared across different cancer types.The color represents the P-values for 3 ′ aTWAS results.f-g.Effect of cancer-susceptibilityassociated APA-linked genes on cell proliferation of (e) breast cancer (n = 45) and (f) prostate cancer-related (n=8) cell lines for potentially relevant tissue.MYC is a known

Figure S16 .
Figure S16.The correlations between sample sizes, heritability, and the number of identified APA-linked genes across cancer types.a.The GWAS sample sizeshows no significant correlation with the number of identified significant 3'aTWAS genes (FDR<0.05).b.There is a strong correlation between estimated heritability and the number of significant 3'aTWAS genes (FDR<0.05).The coefficient (R) and P-value were calculated from the Pearson correlation.

Figure S17 .
Figure S17.Functional analyses show evidence of essentiality for 3′aTWAS genes in potentially relevant tissues.Gene knockout experiments to determine gene essentiality for different cancer types (a-l) in their relevant cell lines.The CERES score indicated a depletion of gene-targeting guide RNAs (corrected by copy number).MYC is a known essential gene and was shown as a positive control.Genes with a CERES score ≤ -0.5 (which is deemed as evidence of essentiality) are with purple in the boxplot.The cutoff of CERES value = -0.5 was shown in a red dashed line.For each gene, we tested its significance on cell proliferation based on the count of negative CERES values in the total number of respective cell lines using the Binomial test.The sample size is shown on the Y-axis of the boxplot.

Figure S18 .
Figure S18.Comparisons of the mean CERES score among significant 3'aTWAS genes versus expression TWAS and splicing TWAS.The P-value was calculated from the Wilcoxon's test (two-sided), n=720.

Figure S19 .
Figure S19.Significant APA-linked susceptibility gene CRLS1 in TCGA BRCA data.a. Manhattan plot of TWAS results for TCGA-BRCA.b. boxplot of rs2235816 for normalized PDUI values in TCGA-BRCA datasets (n = 1,094).The center lines within the box plot signify the median values, while the boxes encompass the interquartile range (IQR) from the 25th to the 75th percentile and the outliers are shown as separate dots.c. 3′UTR coverage plot of different alleles of rs2235816 for randomly selected samples in TCGA-BRCA datasets.

Figure S21 .
Figure S21.Validation of knockdown efficiency for the indicated siRNAs and shRNAs.Quantitative reverse transcription (rt)-PCR measuring CRLS1 gene expression of (a) the indicated siRNAs and (b) shRNAs in MCF-7 cells (n=3 independent samples).

Figure S22 .
Figure S22.Experimental validation of other breast cancer APA-linked susceptibility genes.a-c.Example of RNA-seq coverage plot for the 3′UTR of indicated genes.d-f.Luciferase activity from a reporter system containing the short and long 3′UTR of indicated genes in MCF-7 cells.Significance was determined by the unpaired Student's t-test (two-tailed), comparing the experimental group with the control.n=3 independent samples.g-i.Quantitative reverse transcription (qRT)-PCR measuring indicated gene expression upon shRNA knockdown in MCF-7 cells.n=3 independent samples.j-l.Cell proliferation of shRNA-mediated knockdown cells was analyzed on 1, 3, 5, and 7 days.For panels d-f, the bar graphs show the mean ± standard deviation values from three independent experiments.