Reference SVA insertion polymorphisms are associated with Parkinson’s Disease progression and differential gene expression

The development of Parkinson’s disease (PD) involves a complex interaction of genetic and environmental factors. Genome-wide association studies using extensive single nucleotide polymorphism datasets have identified many loci involved in disease. However much of the heritability of Parkinson’s disease is still to be identified and the functional elements associated with the risk to be determined and understood. To investigate the component of PD that may involve complex genetic variants we characterised the hominid specific retrotransposon SINE-VNTR-Alus (SVAs) in the Parkinson’s Progression Markers Initiative cohort utilising whole genome sequencing. We identified 81 reference SVAs polymorphic for their presence/absence, seven of which were associated with the progression of the disease and with differential gene expression in whole blood RNA sequencing data. This study highlights the importance of addressing SVA variants and potentially other types of retrotransposons in PD genetics, furthermore, these SVA elements should be considered as regulatory domains that could play a role in disease progression.


INTRODUCTION
Parkinson's disease (PD) is the second most common neurodegenerative disease affecting 1% of the population over 60 years of age 1 . PD pathology is characterised by the loss of dopaminergic neurons from the substantia nigra and the formation of neuronal inclusions called Lewy bodies. The primary symptoms are related to motor dysfunction (bradykinesia, resting tremor, rigidity, and postural instability), however, individuals with PD can also develop a range of non-motor symptoms that include sleep disturbances, constipation and impaired olfaction 2 . The precise mechanisms underlying the development of PD are unknown. Genetic studies have played a key role in highlighting pathways involved in disease pathogenesis that include autophagy, mitophagy, and the innate immune system 3,4 . There are highly penetrant mutations in several genes that cause rare monogenic forms of PD in either an autosomal dominant (e.g., SNCA and LRRK2) or an autosomal recessive (e.g., PRKN and PARK7) manner 5 . However, monogenic forms of the disease contribute a small percentage to the total incidence of the disease, with the majority of cases classed as sporadic and likely caused by a complex interaction of genetic and environmental risk factors. Large-scale genome wide association studies (GWAS) have been crucial in identifying genetic risk factors of idiopathic disease, however, a large proportion of the disease heritability has yet to be identified 6,7 . In addition, many of the single nucleotide polymorphisms (SNPs) that are associated with disease development are in non-coding regions of the genome and their functional impact is often unknown. We aim to investigate the genetic component of PD further by studying other types of variation and their functional consequences and this study focuses on variation generated by a specific family of retrotransposons.
SVAs are composite elements (domains outlined in Fig. 1a) consisting of seven subfamilies (A-F and F1) and are the evolutionarily youngest member of the non-long terminal repeat class of retrotransposons 9,10 . Members of the SVA family are still able to retrotranspose in the human genome 11 , and this ongoing mobilisation has led to insertions that are polymorphic in the human population for their presence or absence. Of those SVAs in the reference genome 77 have been reported in the dbRIP database as polymorphic in terms of their presence or absence 12,13 . There are also SVA insertions that are not in the reference genome and fourteen such polymorphic SVA insertions have been identified as the cause of diseases including X-linked dystonia parkinsonism (XDP), Neurofibromatosis type 1 and haemophilia B through mechanisms such as loss of function mutation, modulation of splicing and deletions at the site of insertion [14][15][16] . XDP is associated with a founder haplotype that contains a SVA in intron 32 of the TAF1 gene, reduced expression of TAF1 and alternative splicing and intron retention directed by the SVA insertion 17 . Polymorphic retrotransposons, including members of the SVA family, have been shown to regulate gene expression in a population specific manner and could have an impact on phenotypic differences 18 . In the literature retrotransposon insertion polymorphisms (RIPs) have also been identified to be in linkage disequilibrium (LD) with SNPs associated with a range of diseases through GWAS, further analysis of selected candidates identified six RIPs that could lead to disease through changes in gene regulation 19 .
Our study focused on the potential role of reference SVA RIPs established in the population in the risk and progression of PD. To address this we utilised whole genome sequence, transcriptomic and clinical data from the Parkinson's Progression Markers Initiative (PPMI) which was established to identify PD progression markers to help understand disease aetiology and ultimately aid in the development of novel therapeutics 20 . The PPMI is a longitudinal study collecting clinical and imaging data and biospecimens from PD subjects, healthy controls and those subjects classified as SWEDD (scans without evidence of dopaminergic deficit) 21 . Using the extensive dataset from the PPMI cohort, we identified 81 SVAs that were polymorphic for their presence or absence in the reference genome. Seven of the identified SVA RIPs were associated with PD progression and subsequent analysis of these seven RIPs correlated specific genotypes with changes in gene expression in the blood.

RESULTS
Characterisation of reference SVA RIPs in the PPMI cohort The PPMI cohort, whose whole genome sequence data were analysed in this study, consisted of 179 healthy controls, 371 PD and 58 subjects classified as SWEDD. The subjects from each group showed similar demographics in terms of age and sex (Table 1) and all subjects included reported race as white. In the 608 subjects analysed using the structural variant caller Delly (https://github.com/dellytools/delly), 83 SVAs in the human reference genome were identified as absent in at least one genome compared to the reference (hg38) (Supplementary Table 1). Two of these were detected as absent in all genomes analysed and therefore were not classified as polymorphic in this cohort. The same individuals were also analysed using mobile element locator tool deletion (MELT-del) 22 which can detect polymorphic reference retrotransposons. MELT-del only detected 50 reference SVAs as absent compared to the 83 by Delly. DNA samples from the PPMI cohort were obtained for validation purposes of the variants detected. Three SVAs were amplified using PCR, two of which (SVA_27 and SVA_32) were detected by both Delly and MELT-del and the third (SVA_24) was only detected by Delly (see Supplementary Fig. 1 for representative gel images). The genotypes of the three SVAs called by Delly were in 98.4-99.4% agreement with the PCR genotypes compared to 91.5-95.6% for those called by MELT-del (Supplementary Table 2). Genotypes called by the tool Delly demonstrated greater accuracy and a higher number of SVA RIPs were detected compared to MELT-del, therefore Delly was chosen as the most suitable tool for this analysis.
Of the 81 SVAs RIPs detected by Delly, 28 (34.6%) had not been reported previously in the dbRIP track available on UCSC genome browser. The majority, 85.2%, of the SVA RIPs belonged to subtypes E, F and F1 (Fig. 1b), which was consistent with these human specific subfamilies being the most recently active 9 . The SVA RIPs showed a wide range of insertion allele frequencies from 0.0008-0.994 (Fig. 1c). Using logistic regression and Bonferroni correction for multiple testing there were no SVA RIPs significantly associated with PD when comparing control and PD subjects in this cohort (Supplementary Table 1). Tagging SNPs (r 2 > 0.8) were identified for 76 (93.8%) of the SVA RIPs and were used to act as a proxy for the SVA genotype, enabling the evaluation of these elements in the largest PD and healthy control dataset available from the recent GWA meta-analysis performed by Nalls et al 6 . In the summary statistics from this meta-analysis, 75 of the SVA RIP tagging SNPs were present (Supplementary Table 3). One of these SNPs (rs55653937) was reported as associated with PD at genome wide significance (p = 1.64×10 −19 , β = −0.2488) and was in strong LD (r 2 = 0.994) with SVA_67 located 12 kb upstream of the KANSL1    Reference SVA RIPs are associated with progression markers of PD in the PPMI cohort The PPMI cohort has extensive clinical and phenotypic data available at four time points, baseline (0 months) and subsequent measurements at 12, 24 and 36 months. This data includes motor, cognitive and olfactory testing, dopamine transporter (DAT) imaging assessments and analysis of markers in biospecimens. Analysis of the 81 SVA RIPs identified seven associated with features of PD (Table 2) at 12 months or later. In Table 2 the FDR corrected p values are reported for the linear mixed effects model and Tukey adjusted p values for the pairwise analysis addressing the genotypic effect. The nearest gene to each of these SVAs are reported in Table 3 and for chromosomal loci see Supplementary  Table 1. Four of the SVA RIPs (36, 47, 67 and 79) in Table 3 were associated with changes in the count density ratios (CDR) of the caudate/putamen, which had been calculated from DaTscan single photon emission computed tomography (SPECT) imaging. Figure 2a Table 3 were associated with the following measurements reflecting the progression of PD: Hoehn and Yahr staging and the Movement Disorder Society -Unified Parkinson's Disease Rating Scale (MDS-UPDRS). For example, the presence of SVA_11 was significantly associated with a lower mean MDS-UPDRS part II score at 36 months ( Fig. 2b) when comparing PP (7.6, 95% CI 6.3-8.9) and PA (8.9, 95% CI 8.2-9.6) genotypes to AA (10.9, 95% CI 9.6-12.2, p = 0.001 and p = 0.02). SVA_24 was significantly associated with changes in the categorical Hoehn and Yahr stage in PD subjects (Fig. 2c The genotype of specific SVA RIPs is associated with differential gene expression Transcriptomic data from whole blood was utilised to determine if the seven SVA RIPs associated with PD progression (Table 3) had any influence on gene expression. RNA sequencing data from whole blood was available at baseline (0 months) and four additional time points (6, 12, 24 and 36 months). In the analysis, all subjects in the cohort were used to compare gene expression profiles between subjects with the three different genotypes (PP, PA and AA) based on the presence or absence of the specific SVA RIP with statistical analysis performed at 0 and 36 months. For all of the seven SVA RIPs analysed, at least one genotype at one time point was associated with differential gene expression ( Table 3). The number of genes whose differential expression was significantly associated with specific SVA genotypes (FDR < 0.1) at 0 and 36 months is summarised in Table 3 and the gene with the lowest adjusted p value is shown in brackets. Genotypes of SVA_7 and SVA_36 were associated with changes in gene expression of hundreds or thousands of genes at many different loci. Whereas the effects of other SVA RIPs (24 and 67) were restricted to a smaller number of genes (Table 3), many of which were located at the SVA locus, and several of those genes were affected at both time points analysed. However, one SVA RIP demonstrated a narrower effect on gene expression with only the expression of one gene (RAB3B) associated with SVA_79 genotype. The effect of the SVA RIPs on gene expression varied between the different elements and below we highlight specific examples of genes or clusters whose expression was associated with multiple SVA genotypes and/or at both time points at which statistical analysis was performed. The expression of FLACC1 was significantly different for individuals with PP genotype compared to AA for SVA_11 (located in an intron of CASP8) at 0 (FDR p = 0.04) and 36 months (FDR p = 2.22 × 10 −5 ) (Fig. 3a). FLACC1 is adjacent to CASP8 in the genome and SVA_11 is 3.5 kb downstream of its 3'UTR. The expression of HLA-A was repeatedly the most significant gene to be associated with SVA_24 (located 8 kb upstream of HLA-A) genotypes at 0 and 36 months (Table 3). At both 0 and 36 months the subjects with PP genotypes had significantly higher HLA-A expression than those with AA genotypes (FDR p = 6.22 × 10 −51 and p = 2.49 × 10 −38 ) (Fig. 3b). Those with PA genotypes also showed significantly higher HLA-A expression than AA genotypes at 0 and 36 months (FDR p = 1.90 × 10 −78 and p = 7.51 × 10 −60 ), however, there was no significant difference when comparing PP and PA genotypes (Fig. 3b).
SVA_67 at the chromosomal locus 17q21.31 was associated with differential expression of multiple genes, including six in a 1.15 Mb region centred around the SVA RIP (Fig. 4a). At baseline when comparing PP and AA genotypes three (PLEKHM1 (FDR p = 2.38 × 10 −6 ), ARL17A (FDR p = 8.72 × 10 −5 ) and CRHR1 (FDR p = 7 × 10 −4 )) out of the four significantly associated genes were located in this region as were five (PLEKHM1 (FDR p = 2.40 × 10 −9 ), ARL17A (FDR p = 1.21 × 10 −9 ), CRHR1 (FDR p = 1.47 × 10 −8 ), MAPT (FDR p = 0.001) and LRRC37A (FDR p = 0.03)) out of seven genes whose expression was significantly different when comparing PP to PA genotypes ( Fig. 4b-f). Extending the analysis to expression data at 36 months there were 22 genes whose expression was significantly different when comparing PP and AA genotypes of SVA_67, 4 of which were located in the 1.15 Mb region (PLEKHM1 (FDR p = 0.008), ARL17A (FDR p = 0.006), CRHR1 (FDR p = 2.5 × 10 −4 ) and MAPT (FDR p = 0.002)) ( Fig. 4b-e). These same four genes as well as two others (LRCC37A (FDR p = 0.002) and KANSL1 (FDR p = 0.06)) in this genomic region were also differentially expressed when comparing PP and PA genotypes at 36 months ( Fig. 4b-g). Four of the genes in this region whose expression was associated with SVA_67 (PLEKHM1, ARL17A, CRHR1 and MAPT) all showed higher expression in individuals with PP genotypes and the individuals with the lowest expression had an AA genotype. This was in contrast with the levels of expression of LRCC37A and KANSL1 where the opposite pattern was observed.
For the seven SVAs analysed for their association with differential gene expression six had tagging SNPs available. The Genotype-Tissue Expression (GTEx) portal (https://gtexportal.org/ home/) 23 and the six SVA tagging SNPs were used to extend the findings above. Five of the tagging SNPs were located in the GTEx portal, three of which were identified as expression quantitative trail loci (eQTLs) in several tissues ( Table 3). The tagging SNP for SVA_67 was identified as an eQTL in 49/54 tissues, including 13 brain regions, and was associated with expression of over 30 genes, which included the six genes in Fig. 4.

DISCUSSION
This study incorporated genetic, clinical and gene expression data from the PPMI cohort to evaluate the role of reference SVAs in PD, identifying elements associated with disease progression that can modify the transcriptome. In this analysis, we characterised novel variation in the human genome identifying 81 reference SVA RIPs. In the PPMI cohort we did not identify any SVAs that were associated with disease risk. However, in the literature RIPs have been reported as candidate causative variants at known disease A.L. Pfaff et al.   associated loci by identifying those in strong LD with trait associated SNPs 19,24 . Utilising this approach we identified one SVA RIP in our cohort in strong LD with a SNP associated with PD risk at genome wide significance. This SVA (SVA_67) is located in a 1.8 Mb region of high LD on chromosome 17 where there are two predominant haplotypes (H1 and H2) and the presence of SVA_67 is part of H1 with the SVA absent in H2. H1 is associated with increased risk of developing PD in multiple studies [25][26][27] . This region encompasses several genes and includes MAPT in which mutations can cause the neurodegenerative diseases frontotemporal dementia with parkinsonism and progressive supranuclear palsy 28,29 . However, it remains unclear which variant or gene within the H1 haplotype is responsible for the increased risk to PD and our study has highlighted additional types of variation to be considered at this locus. Functional work would be required to further explore the influence of SVA _67 at this locus and to identify the functional risk allele in this complex region. The role of SVAs in the progression of PD has not been thoroughly addressed in the literature and here we identified seven SVAs associated with the progression of PD (Table 2). Clinical scales such as the Hoehn and Yahr and the MDS-UPDRS are used to measure disease severity and three of the SVA RIPs were associated with significant differences in these scores between the different genotypes of the insertions. The remaining four SVA RIPs were associated with changes observed in DaTscan SPECT analysis, specifically the ratio of caudate to putamen CDR ( Table 2). DaTscan SPECT utilises a radioligand that binds to the DAT that is present on the presynaptic membrane of nigrostriatal neurons. The loss of the DAT from the striatum is characteristic of PD, correlated with the loss of dopaminergic neurons from the substantia nigra and used to distinguish PD from other neurodegenerative parkinsonisms such as essential tremor 30 . The reduction of striatal DAT correlates with symptom severity and changes over the course of the disease, therefore this can be used to monitor dopaminergic degeneration 31,32 . The caudate/ putamen ratio is a measure of the anterior to posterior gradient of dopaminergic loss and increases in this gradient were associated with specific SVA RIP genotypes. The SVAs associated with these changes may not be directly causing these alterations to the gradient of dopaminergic loss but may be involved in modifications to the disease course that are measured by these features of DaTscan SPECT.
We have previously shown that SVAs can affect gene expression in in vitro and in vivo models using reporter gene assays 8,33 . More recently, studies have integrated genetic data regarding SVA RIPs and RNA sequencing to identify those elements that regulate transcription 18,19 . The SVA RIPs associated with the progression of PD had a significant impact on the transcriptome and this could be one of the mechanisms by which the SVAs are exerting an effect on PD progression. Genes at both the SVA RIP locus and distant regions of the genome, including different chromosomes, were associated with specific SVA RIP genotypes. For example, SVA_24 was most significantly associated with robust changes of expression of its nearest gene (HLA-A) and SVA_67 was associated with differential expression of a cluster of neighbouring genes. In contrast, SVA_36 was associated with changes in expression of thousands of genes at many different loci including those on different chromosomes to the SVA RIP. These differences observed could be due to the SVA insertion itself, the genomic architecture at each particular locus or a combination of both. SVAs could influence gene expression through multiple mechanism such as the recruitment of transcription factors, the interaction of SVAs with distant promoters through 3D chromatin structure and intronic insertions affecting mRNA splicing. Regulatory domains can function in a tissue-specific manner and therefore we utilised the GTEx database 23 to evaluate the potential for the SVAs to act in other tissues using their proxy SNPs. Tagging SNPs for five out of the seven SVAs were located in the GTEx database and three of Table 3. Differentially expressed genes significantly associated with the genotype of SVA RIPs at baseline (0 months) and 36 months follow up visit. The number of genes significantly differentially expressed (FDR corrected p < 0.1) are reported for each SVA RIP genotype followed by the most significant gene in brackets. these were identified as eQTLs in multiple tissues (Table 3), including different regions of the brain such as the caudate, putamen, substantia nigra and cerebellum. This suggests the SVAs (7, 36 and 67) could also be influencing gene expression in other tissues in addition to whole blood demonstrated in this study. The strength of this study was the integration of different types of data to evaluate the role of reference SVAs, which have yet to be addressed in detail, in PD risk and progression. The richness of the PPMI longitudinal data available is a clear benefit to this study and is a very valuable resource. The high quality WGS data from PPMI enabled accurate genotyping of the SVAs and DNA samples available for the same individuals was important for validation. However, the study was limited to a single family of retrotransposons and did not address those belonging to the Alu and L1 families, which would be important to evaluate in future work. Another limitation is the small number of samples for identifying SVAs involved in the risk of developing PD, however, the tagging SNP approach enabled the evaluation of these elements in much larger well-powered case-control cohorts. Using this systematic approach to focus our analysis on SVAs in PD we identified novel genetic variation in the PPMI cohort and specific SVA RIPs that were associated with dopaminergic degeneration. We were able to demonstrate these elements have a regulatory function and could be influencing the disease course through modulation of gene expression.

METHODS
Identifying SVAs polymorphic for their presence/absence in the reference genome from whole genome sequencing data from the PPMI cohort Whole genome sequencing data in bam file format aligned to hg38 were obtained from the PPMI. The use of data from PPMI was approved by the Human ethics committee at Murdoch University (2020/040). Written and informed consent was obtained by PPMI when participants were enrolled. The structural variant caller Delly2 (https://github.com/dellytools/delly) 34 , with default settings, was used to genotype structural variants in 608 individuals (179 healthy controls, 371 PD and 58 SWEDD) from the PPMI cohort. Those structural variants located within the coordinates of the human specific SVAs, available on the UCSC genome browser (https:// genome.ucsc.edu/) 35 Fig. 2 Polymorphic SVAs were associated with markers of PD progression in the PPMI cohort. a PD individuals with the AA genotype of SVA_67 had a significantly higher ipsilateral caudate/putamen CDR at both 12 and 24 months compared to those with PA and PP genotypes. b PD individuals with the AA genotype of SVA_11 had a significantly higher MDS-UPDRS part II score at 36 months compared to those with PA and PP genotypes. c PD individuals with the PA genotype of SVA_24 had a significantly higher Hoehn and Yahr score at 36 months compared to those with AA and PP genotypes. 95% confidence intervals are represented by the blue bar. *p < 0.05, **p < 0.01 and ***p < 0.001 (Tukey adjusted p values are reported).     Table 4) were analysed in the PD subjects from four visits at baseline (0), 12, 24 and 36 months for the majority of the data points (data relating to the DaTscan SPECT imaging was not available at 36 months). The p values from the linear mixed-effects model were corrected for the number of features analysed and the FDR corrected p-value is reported in Table 2. A pairwise analysis was then performed using estimated marginal means on the SVAs that were significant after FDR correction for an effect size comparison with Tukey adjusted p-value for family-wise error rate reported in Table 2. The number of PD subjects analysed at each time point for each genotype of the seven significant SVAs in the linear mixedeffects model and pairwise comparison are reported in Supplementary  Table 5.
Analysis of SVA RIP genotype and gene expression in blood RNA-sequencing data from the PPMI cohort

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.