Golgi apparatus, endoplasmic reticulum and mitochondrial function implicated in Alzheimer’s disease through polygenic risk and RNA sequencing

Polygenic risk scores (PRS) have been widely adopted as a tool for measuring common variant liability and they have been shown to predict lifetime risk of Alzheimer’s disease (AD) development. However, the relationship between PRS and AD pathogenesis is largely unknown. To this end, we performed a differential gene-expression and associated disrupted biological pathway analyses of AD PRS vs. case/controls in human brain-derived cohort sample (cerebellum/temporal cortex; MayoRNAseq). The results highlighted already implicated mechanisms: immune and stress response, lipids, fatty acids and cholesterol metabolisms, endosome and cellular/neuronal death, being disrupted biological pathways in both case/controls and PRS, as well as previously less well characterised processes such as cellular structures, mitochondrial respiration and secretion. Despite heterogeneity in terms of differentially expressed genes in case/controls vs. PRS, there was a consensus of commonly disrupted biological mechanisms. Glia and microglia-related terms were also significantly disrupted, albeit not being the top disrupted Gene Ontology terms. GWAS implicated genes were significantly and in their majority, up-regulated in response to different PRS among the temporal cortex samples, suggesting potential common regulatory mechanisms. Tissue specificity in terms of disrupted biological pathways in temporal cortex vs. cerebellum was observed in relation to PRS, but limited tissue specificity when the datasets were analysed as case/controls. The largely common biological mechanisms between a case/control classification and in association with PRS suggests that PRS stratification can be used for studies where suitable case/control samples are not available or the selection of individuals with high and low PRS in clinical trials.


Supplementary
X-axis represents ranks of GO-terms derived by Catmap and Y-axis represent GO-terms derived by topGO (classic algorithm with ks statistic or Kolmogorov-Smirnov test). p-values and r 2 were derived using a linear model. The most significant GO term (p-value) will have a rank of 1.  X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster.

Supplementary Figure 12. GO term semantic similarity clustering, PRS cerebellum and PRS temporal cortex (gene order most down-regulated at top; GO down-regulated) a) Biological Process (BP) b) Cellular Component (CC)
X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster. Top GO (overlapping)

Supplementary Figure 19. GO term semantic similarity clustering, cerebellum case/control and PRS (gene order based on p-values only; GO no direction) a) Biological Process (BP) b) Cellular Component (CC)
X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster.

Supplementary Figure 20. GO term semantic similarity clustering, cerebellum case/control and PRS (gene order most up-regulated at top; GO up-regulated) a) Biological Process (BP) b) Cellular Component (CC)
X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster.

Supplementary Figure 21. GO term semantic similarity clustering, cerebellum case/control and PRS (gene order most down-regulated at top; GO down-regulated) a) Biological Process (BP) b) Cellular Component (CC)
X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster.

Supplementary Figure 22. GO term semantic similarity clustering, temporal cortex case/control and PRS (gene order based on p-values only; GO no direction) a) Biological Process (BP) b) Cellular Component (CC)
X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster. X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster.

Supplementary Figure 24. GO term semantic similarity clustering, temporal cortex case/control and PRS (gene order most down-regulated at top; GO down-regulated) a) Biological Process (BP) b) Cellular Component (CC)
X and Y axes represent CMD dimension 1 and 2. GO term p≤0.05 FDR. Green dots represent significant GO terms from the PRS analysis of cerebellum, Blue dots represent significant GO terms from the PRS analysis of temporal cortex, Red dots represent significant GO terms overlapping in PRS analysis of cerebellum and temporal cortex. Cluster labels were manually curated based on the most common GO term in the cluster.

Sample Data
We used the MayoRNAseq study (MayoRNAseq) [1], part of the Accelerated-Medicine Partnership (AMP-AD). MayoRNAseq is a post-mortem brain cohort sample of individuals with a neuropathological diagnosis of AD, progressive supranuclear palsy, or pathological ageing, and elderly controls. The MayoRNAseq human brain-derived samples comprise temporal cortex and cerebellum tissues. The number of samples and other descriptors can be found in Supplementary Table 1.

RNA-seq quality control (QC)
RNA-seq QC of the aligned GRCh38.38 bam files was performed using RNA-SeQC 2.3.5 [4]. Individual RNA-SeQC measures were converted to percentage and means and standard deviations were calculated for these measures separately within the two tissue MayoRNAseq samples, cerebellum and temporal cortex. Samples were excluded from further analysis if a specific RNA-SeQC measure for an individual RNA-seq sample was 4 standard deviations away from the mean of the distribution (³ or £ 4 standard deviations considered for different measures). More information on the measures used is provided in the supplementary materials (Supplementary Table 2).
Read counts per gene per sample were derived using htseq-count (v0.11.2). Samples were removed from further analysis if they had 0 read counts across all genes. Genes were removed from further analysis if they had 0 read counts across all samples. Read counts were normalised using the trimmed mean of M-values (TMM) normalization method in the R/bioconductor package edgeR [5] (v3.34.0) to estimate scaling factors and to adjust for differences in library sizes. Genes were excluded from further analysis if TMM values were <0.5 in 50% of the MayoRNAseq samples.
Raw gene counts derived from htseq-count for the remaining samples were normalized using Conditional Quantile Normalization (CQN) [6] (R/bioconductor package cqn v1.38.0) to use for principal component analysis (CQN-normalised counts; y+offset). Gene exon lengths and GC content were calculated from the gtf and fasta files using custom built programs.

VCF QC
Individual vcf files were converted and merged with PLINK [7] (PLINK v2.00a2.3) and bi-allelic variants were kept. For ethnicity estimates we also downloaded phase3 1000 Genomes Project reference data [8] from PLINK's website (https://www.coggenomics.org/plink/2.0/resources#1kg_phase3) and converted to PLINK format as well as removed duplicates with respect to genomic position. Only variants that were present in the 1000 Genomes phase3 were kept for further analysis (variants matched by chromosome and position). The chromosome and position of the remaining variants were converted with respect to the human genome reference GRCh38 using the UCSC web-based liftover tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver). Variants with Hardy-Weinberg equilibrium (p£1x10-06), missingness (³0.05) and minor allele frequency (£0.01) were excluded from further analysis. The remaining filtered variants were combined with the 1000 Genomes phase3. Ancestry was estimated using Principal Component Analysis (PCA) in PLINK2 (-pca --maf) by plotting the first two eigenvectors and samples were excluded from further analysis if a sample deviated from the 1000 Genomes EUR cluster (Supplementary Figure  1c). In order to estimate individual sex, we removed the pseudo-autosomal region of the X chromosome and calculated inbreeding coefficients (F) using PLINK (--check-sex). Individuals with F£0.2 were deemed females and F³0.8 males.
Genetic relationship between the samples (pairwise identity by descent (IBD)) was determined using PLINK (--genome full --min 0.1). Samples that had PI-HAT³0.22 were considered duplicates, or first-degree relatives (including identical twins). All such pairs of samples were excluded from further analysis.

Matching RNA-seq to VCF samples
Matching individual RNA-seq with the VCF samples was done using verifyBamID [9] (v1.1.3). A matched sample comprising RNA-seq and WGS vcf files from the same individual was done based on the IBD coefficients from verifyBamID; a matched sample was deemed IBD³0.8. Only matched RNA-seq with WGS vcf files were included for further analysis.

AD diagnosis and APOE status
The MayoRNA-seq dataset included diagnostic status in the RNA-seq covariates file from the AMP-AD knowledge portal. We excluded samples with progressive supranuclear palsy and pathological ageing, retaining only data from samples with a label of AD or control. To assign APOE status we used rs429358 and rs7412. The ambiguous double heterozygotes were coded as E2/E4 (Supplementary Table 3).

RNA-seq differential gene-expression
We used R/bioconductor package DESeq2 (v.1.32.0) [10] with raw htseq-counts with age at death, sex and APOE status as covariates (DESeq2 model matrix: design=~age_at_death+sex+APOE_status+diagnosis for case/control analysis and design=~age_at_death+sex+APOE_status+PRS for PRS; log fold changes and p-values are returned for the last variable in the design matrix). To account for multiple hypotheses testing the Benjamini-Hochberg false discovery rate was used (FDR).

Gene Ontology
The Wilcoxon rank sum test, as implemented in Catmap [11], was used to test for significant enrichment of Gene Ontology (GO) categories (differential gene-expression associated with PRS and separately for differential gene-expression with case/controls) using custom built gene-GO gene association (Supplementary Materials and Methods). Ranks of genes were based on the p-value from the significance of the differential gene-expression (from DESeq2). For all tests, three lists were derived comprising (1) differentially expressed genes based on p-value only (termed no-direction), (2) the most differentially up-regulated (p-value and log-fold) genes at the top of the list and most differentially down-regulated genes (log-fold< 0) at the bottom of the list (termed up-regulated) and (3) the most differentially down-regulated (log-fold< 0) genes at the top of the list and most differentially up-regulated genes (log-fold> 0) at the bottom of the list (termed down-regulated). Gene lists 2) and 3) are inverted copies of each other. We used random null as the null hypothesis. Even though a random null is not as good approximation as compared to sample label permutations [11], it was deemed computationally unfeasible to perform sample-label permutations. Nevertheless, we also performed a functional GO enrichment analysis using a separate method (topGO [12]; v2.44.0) with the three sets of ranked list of genes using the classic algorithm with the ks statistic (Kolmogorov-Smirnov test) and compared the results with Catmap ( Supplementary Figs. 7 and 13). To account for multiple hypotheses testing the Benjamini-Hochberg false discovery rate was used. Statistical significance of overlaps of GOs between two experiments (e.g. Catmap vs. topGO, significant GO terms associated with PRS vs. significant GO terms from a case/control analysis) was determined using a hypergeometric test, including Biological Process (BP), Cellular Component (CC) and Molecular Function (MF) GO terms (genes Supplementary Figs. 2b-f, 8b-f, 14b-f, 15a-e, 15b-f and GO terms main text Figs. 2b, 3b&d, 4b&d and Supplementary Figs. 3a-e, 9a-e, 17a-e, 18a-e) and profile similarity by using a paired rank-based test for association based on Spearman's rho ( Supplementary Figs. 2a, 3f-h, 8a, 9f-h, 14a, 16a, 17f-h, 18f-h).
For clustering of statistically significant GO terms we used semantic similarity (GOSemSim [13]) with Rel information content measure and classical multidimensional scaling (CMD; cmdscale package in R, k=2) separately for BP and CC GO terms as semantic similarity can only be performed within BP or CC. The most representative (manually curated) GO term was chosen as the name for describing CMD clusters. This process was used for Figures 2a, 3a&c, 4a&c in the main text as well as Supplementary Figs. 4-6, 10-12 and 19-24.

PRS calculations
To generate polygenic risk scores for the MayoRNAseq dataset we used the summary statistics from the clinically assessed case/control study on AD [14], excluding AMP-AD samples that are part of that GWAS [14]. We chose to use Kunkle et al. [14] AD GWAS, as it does not include the UK Biobank summary statistics, where cases are defined via family history (AD proxies [15]) and controls are not screened for AD. PRS were calculated using PLINK for pT£0.1 (p-value threshold) on LD-clumped SNPs by retaining the SNP with the smallest p-value excluding SNPs with r 2 >0.1 in a 1000kb window. We chose to use the simplest clumping and thresholding approach as the simplicity guarantees the transparency (included SNPs and weights are known) and the prediction accuracy of AD by the PRS is similar for the majority of methodologies as previously described [16]. All derived scores were adjusted for 5 consecutive principal components then standardised within the MayoRNAseq samples (mean and standard deviation).