Intronic ATTTC repeat expansions in STARD7 in familial adult myoclonic epilepsy linked to chromosome 2

Familial Adult Myoclonic Epilepsy (FAME) is characterised by cortical myoclonic tremor usually from the second decade of life and overt myoclonic or generalised tonic-clonic seizures. Four independent loci have been implicated in FAME on chromosomes (chr) 2, 3, 5 and 8. Using whole genome sequencing and repeat primed PCR, we provide evidence that chr2-linked FAME (FAME2) is caused by an expansion of an ATTTC pentamer within the first intron of STARD7. The ATTTC expansions segregate in 158/158 individuals typically affected by FAME from 22 pedigrees including 16 previously reported families recruited worldwide. RNA sequencing from patient derived fibroblasts shows no accumulation of the AUUUU or AUUUC repeat sequences and STARD7 gene expression is not affected. These data, in combination with other genes bearing similar mutations that have been implicated in FAME, suggest ATTTC expansions may cause this disorder, irrespective of the genomic locus involved.

F AME (also referred to as Familial Cortical Myoclonic Tremor and Epilepsy or Benign Adult onset Familial Myoclonic Epilepsy [OMIM phenotypic series: PS601068]) is characterised by cortical myoclonic tremor and overt myoclonic and later generalised tonic-clonic seizures (GTCS) 1 . Onset of symptoms occurs in the second to third decade with variable expressivity within and between families; anticipation has been noted in some families 1 . The frequency of GTCS varies from 15 to 100% in 22 different families reported here (Table 1) 2 . Seizures are typically controlled with anti-epileptic drugs for generalised epilepsies, although rarely individuals have drug resistant epilepsy. FAME has been mapped to four distinct chromosomal loci. Most families link to chromosomes 8q24 3 or 2p11.2-q11. 2 4 , with an additional two families mapping to chromosome 5p15.31-p15 5 and one to chromosome 3q26.32-q28 6 . There is one report of autosomal recessive FAME caused by mutation in CNTN2 where the phenotype was disputed 7,8 . Candidate genes and variants that fall within these common linkage intervals have been suggested for chr2 (ADRA2B) and chr5 (CTNND2); however, none of these genes have been shown to be allelic in all FAME families with linkage to the same interval 1 . We previously showed using identity-by-descent mapping that there are at least four distinct founder loci linked to FAME2 (OMIM:607876) on chr2 9 .
The genetic cause of FAME has long remained elusive. The cause of FAME1, which is linked to chr8 (OMIM:601068), has recently been shown to be a complex repeat expansion of pentameric TTTTA and inserted TTTCA repeats into the fourth intron of the SAMD12 gene 10,11 . In the same study, TNRC6A (chr16) and RAPGEF2 (chr4) were implicated as FAME genes within single families, respectively, found via direct detection of the same repeated TTTTA and TTTCA sequences 11 .
Here, we use bioinformatic analysis of short-read whole-genome sequencing to identify ATTTT and ATTTC repeat expansions in the FAME2 linkage interval. We screen for an intronic ATTTC expansion in the first intron of STARD7 by repeat-primed PCR and show it segregates with FAME2 in 158 affected individuals from 22 families. We use long-read sequencing to suggest the ATTTT and ATTTC expansions may be somatically unstable. We analyse clinical data and show evidence of anticipation over multiple generations of a large FAME2 family. Finally, we demonstrate that the presence of the ATTTC repeat has no effect on protein or mRNA expression levels of STARD7 in available patient cell lines. These data suggest the repeat sequence alone is pathogenic, independent of an effect on the coding sequence of the encompassing gene.

Results
Discovery of a repeat expansion in STARD7. We analysed Illumina HiSeq X-10 whole-genome sequencing data initially from two individuals from a large Australian-New Zealand FAME family, one from an Italian family and three from a French-Spanish family (Table 1 and Supplementary Table 1; Families 1, 3 and 19, respectively) 2,12,13 with two repeat expansion detection methods, ExpansionHunter and exSTRa 14,15 , to look for similar combined ATTTT and ATTTC repeat expansions on both the forward and reverse chromosome strands within the FAME2 interval. This revealed an expansion of an ATTTT repeat and insertion of an ATTTC repeat in the context of the reverse strand of chr2 within the first intron of STARD7 (StAR-related lipid transfer domain-containing 7) in all FAME samples tested (Fig. 1a, Supplementary Fig. 1). The endogenous ATTTT repeat in intron 1 of STARD7 was also found to be variable in length in the normal population but not expanded to the same extent as repeats found in individuals with FAME. The ATTTC repeat was not present in any whole-genome sequencing data from 69 control CT cortical tremor, Focal Sz focal seizures, TCS tonic-clonic seizures, y years, n number of individuals a One family member last evaluated at 9 years of age samples ( Supplementary Fig. 1), nor is it reported in the Simple Repeats track in the UCSC genome browser (build hg38) 16 .
Segregation of STARD7 ATTTC expansions by repeat-primed PCR. We developed a repeat-primed PCR (RP-PCR) assay to rapidly identify the expansion in 137/137 affected individuals from 16 independently reported FAME2 families worldwide (Fig. 1c,  subsequent analyses with microsatellite markers showed that these individuals did not have the same haplotype as affected carriers of the ATTTC expansion ( Supplementary Fig. 3). The ATTTC repeat expansion did not amplify in any of 28 control DNA samples extracted from unaffected individuals unrelated to FAME.
In all 158 individuals that tested positive for the ATTTC expansion, we observed that priming from ATTTT repeats was only successful from the telomeric end of the endogenous repeat and priming from ATTTC repeats was only possible from the centromeric end of the endogenous repeat. This suggested the structure of the pathogenic repeat in the context of the forward strand of chr2 was (AAATG)n[N](AAAAT)n, where (n) represents the unknown number of each repeat sequence.
Long-read sequencing reveals the repeat structure. The total numbers of repeats could not be determined by the RP-PCR assay, therefore we investigated some of these with long-read sequencing (Fig. 2). In one individual from the Australian-New Zealand family (Family 1: IV-98) a single molecule real-time (SMRT) read and a single Oxford Nanopore read were found that spanned the repeat. The SMRT read generated to 99% base accuracy by circular consensus calling was comprised of four subreads and contained 274 AAATG and 387 AAAAT repeats, without interruption from other sequences. The Oxford Nanopore read contained 345 AAATG and 390 AAAAT repeats with some interruptions, suggesting somatic variation of repeat sizes   Fig. 1 Identification of an expanded pentameric ATTTC repeat causing FAME2. a Estimated sizes of the AAATG repeats in two affected individuals from Family 1 (red, orange), one from Family 3 (brown) and three affected individuals from Family 19 (blue, green, purple), compared to 69 individuals without FAME using TruSeq Nano (grey) or KAPA Hyper (tan) library preparation. Left panel shows empirical cumulative distribution functions from exSTRa panel while the right panel shows the estimated repeat size by Expansion Hunter (the sum of both alleles suggests repeat sizes of 0.75-2.3 kb). Data underlying this part of the figure are available in Source Data. b WGS data from two individuals in Family 1 and one from Family 3 show reads suggesting expansion of AAAAT and insertion of AAATG repeats in the chr2 linkage interval. c Upper section shows the location of the repeat in the context of chr2. The approximate location of the FAME2 minimal linkage interval is shown above the ideogram with two blue arrow heads. The STARD7 gene is on the reverse chromosome strand and the endogenous AAAAT repeat is found in the first intron of the gene. Schema in the lower section shows the primers used in the RP-PCR to detect the ATTTT "3′ assay" and ATTTC "5′ assay" expanded repeats, respectively. d Example results of the RP-PCR 5′ assay obtained in an individual negative for the ATTTC insert (top panel) and in an individual affected by FAME, positive for the ATTTC repeat insertion (bottom panel). Full screening results are provided in Supplementary Data 1. e Summary of 184 individuals from 22 families tested with the RP-PCR assay. Individuals under category (+) tested positive for the ATTTC repeat and individuals under category (−) tested negative for the repeat may occur within the one individual. In a second individual (Family 5; III-37), a single Oxford Nanopore read spanned the expanded repeats with 588 AAATG and 340 AAAAT repeats; 4645 bp in total length. The natural variability in the length of the endogenous ATTTT repeat sequence meant that is was not feasible to use that sequence for mutation screening; however, the ATTTC repeat primer was diagnostic for FAME with a sensitivity of 100% in all families with linkage or suggestive linkage to chr2. This included two families with the previously identified ADRA2B; c.675_686delTGGTGGGGCTTTinsGTTTGGCAG; p. H225_L229delinsQ225_F_G_R228 variant strongly suggesting that allele is not causative ( Evidence of anticipation in a large FAME2 family. In view of the discovery that FAME2 and FAME1 are caused by similar dynamic mutations of ATTTC repeats, and the demonstration of clinical anticipation in FAME1 11 , we searched for evidence of anticipation in our pedigrees. We examined the median onset age of any relevant symptom, where available, for each generation in the Australian/New Zealand family (Family 1). We found evidence of anticipation; generation III had a median onset of 30 years (range 14-60 y, n = 6), in generation IV median onset was 17 years (8-50 y, n = 30) and the median onset in generation V was 12 years (4-19 y, n = 16). The remaining families were either too small or onset data were unavailable for anticipation to be robustly assessed.
STARD7 transcript and protein abundance are not altered. Reverse transcriptase, quantitative PCR using primer pairs spanning the repeat containing intron between exons one and two and a second pair spanning between exons three and four showed no significant differences in STARD7 transcript expression in patientderived fibroblast cell lines (Fig. 3a). Protein abundance was also unaltered, confirmed by western blotting using an antibody to STARD7 protein that was previously validated using STARD7knockout cell lines (Fig. 3b) 18 . RNA-Seq data from six patientderived fibroblasts (four from Family 1 and two from Family 5) showed there was no significant difference in gene expression of STARD7 between affected and unaffected individuals along the entire length of the gene (Supplementary Fig. 4; p = 0.838; False Discovery Rate = 1). Reads containing ATTTC repeats were not present in the RNA-Seq data despite robust expression of   STARD7. This is consistent with the observations from lymphoblastoid cell lines (LCLs) derived from individuals with FAME1, where no reads with repeats were found 11 .

Discussion
The pathogenic ATTTC insertion and expansion was always accompanied by the endogenous ATTTT pentanucleotide repeat in all cases of FAME2 that we describe here, replicating the findings in the cases of FAME with expansions in SAMD12, TNRC6A, RAPGEF2 10,11,19 and the report of a similar expansion in MARCH6 causing chr5-linked FAME 20 . The same observation also holds for spinocerebellar ataxia 37 (SCA37, OMIM: 615945), which is caused by the same repeat expansion in the first intron of DAB1 21 . For SCA37, it has been hypothesised that the thymidine to cytosine transition occurs after expansion of the endogenous ATTTT repeat to~200 copies followed by further expansion of the mutant ATTTC sequence 22 . The ATTTT/ATTTC strand of the repeat is aligned with the direction of gene expression in all genes reported thus far, regardless of their chromosomal orientation. The mechanism of disease pathogenesis has been suggested to be RNA toxicity 21 . In zebrafish embryos, direct injection of RNA containing 58 copies of the AUUUC repeat was lethal or caused developmental defects in 81%, while the effect of injecting RNA containing 139 AUUUU repeats was not significantly different from controls 21 . Accumulation of AUUUC repeat containing RNA was observed in the brain of some individuals with FAME1, but we did not have access to similar biopsy tissue from individuals with FAME2 11 . While we found no significant change in expression of STARD7 in patient-derived cell lines, it is possible that expression of this gene is regulated differently in the non-proliferating cells of the brain. Profiling expression of all known genes implicated with pathogenic ATTTC dynamic mutations using gene expression data from the GTEX portal https://www.gtexportal.org 23 shows that DAB1 has high expression specifically in cerebellum while the five genes implicated in FAME thus far are more broadly expressed throughout the brain (Fig. 4). This difference in expression may partly explain the absence of epilepsy in individuals with SCA37.
STARD7 is a member of the START (StAR-related lipid transfer) domain-containing family of lipid transfer proteins with functions including intra-mitochondrial lipid transfer of phosphatidylcholine 24 . Previously, increased levels of choline have been detected by proton magnetic resonance spectroscopy ( 1 H-MRS) in the cerebellum of 11 individuals from three Italian families all shown here to have the ATTTC dynamic mutation 25 (Table 1). This observation may be peculiar to FAME2 families since the SAMD12, RAPGEF2, TNRC6A and MARCH6 genes do not have overlapping molecular functions.
In conclusion, we have identified the molecular basis of FAME2 is an inserted expanded ATTTC repeat in the first intron of the STARD7 gene, in 22 pedigrees with 266 affected individuals. The insertion segregates with disease status in 100% of individuals tested from families with linkage or suggestive linkage to chromosome 2 providing substantial genetic evidence that this mutation is causal in this syndrome. The FAME2 locus is the most frequently observed linked region for Caucasian individuals affected by this disorder whereas chromosome 8 thus far is limited to Asian individuals, therefore molecular genetic testing should take this into consideration if choosing to screen by RP-PCR. Identification of the gene and causative mutation for FAME2 opens the opportunity to explore the origins of the ATTTT/ATTTC expansion through a detailed comparison of the haplotypes and repeat structures of these individuals as has been done for SCA37 22 . There may be many additional undiagnosed individuals with a spectrum of FAME-related symptoms whose genetic causes may be due to ATTTC insertion and expansion at one of the FAME loci. This is especially likely in families that have multiple individuals with tremor and a low frequency of GTCS. As no preventive or curative treatments are currently available for FAME, these findings may have important therapeutic implications, including RNA-targeting treatments, such as antisense oligonucleotides or RNA-targeting Cas9 (RCas9) 26 .

Methods
Ethics. This study was approved by the Human Research Ethics Committees of the University of Melbourne and the University of Adelaide. Written, informed consent was obtained from all participants in the study.
Whole-genome sequencing. Adelaide: Human genomic DNA extracting from peripheral blood lymphocytes was prepared from two individuals in Family 1 (IV-3 and V-118) for sequencing using the TruSeq Nano DNA Library Preparation Kit (Illumina). Mapping of 150 bp, paired-end sequence reads to the UCSC hg19 build of the genome and calling of single nucleotide variants from whole-genome sequencing (WGS) data generated using an Illumina HiSeqX10 platform (Kinghorn Centre for Clinical Genomics, Sydney, Australia), was performed as previously described with the minor modification of using the Genome Analysis Toolkit (GATK) version 3.8 software 27,28 . Filtering of both coding and non-coding variants within the chr2 linkage interval shared between both individuals under a dominant model and absent from the gnomAD variant database 29 at a frequency >0.001 was performed using the bcftools isec command from htslib v1.9. Single nucleotide variants and indels were annotated with ANNOVAR 30 . Reads Resulting BAM files underwent local realignment around insertion-deletion sites, duplicate marking and recalibration steps with GATK v3.8 28 . Variant calling was performed with HaplotypeCaller v3.8 with standard parameters, and output VCF files were recalibrated with VariantRecalibrator from GATK v3.8. Genomic variant annotation was carried out with VarSeq v1.4.7 (Golden Helix, Inc., Bozeman, MT, www.goldenhelix.com) and only variants with a minimum read depth of 5X were included in the downstream analysis. Thereafter, only variants in the pericentromeric region of interest of chr2 (chr2: 91,800,000-106,700,000) were considered.
Prioritisation of variants of potential interest was carried out through three distinct analyses. For the first analysis, all variants reported to be pathogenic or potentially pathogenic in the clinical databases of ClinVar, HGMD Professional v2017.2 and/or Centogene CentoMD v4.1 were retained. For the second analysis, we focused on variants in exonic regions without a reported clinical annotation. We excluded variants with a population frequency above 1% in the databases of 1000 Genomes Project, National Heart, Lung and Blood Institute (NHLBI, https://www. nhlbi.nih.gov/) Exome Sequencing Project (ESP, http://evs.gs.washington.edu/), ExAC (Exome Aggregation Consortium, http://exac.broadinstitute.org/) and gnomAD (The Genome Aggregation Database, https://gnomad.broadinstitute.org/), along with variants recorded in the Personal Genomics internal database. We retained all the non-synonymous variants predicted to alter the protein structure or function by at least three of the following in silico prediction tools: Mutation Taster, SIFT, Polyphen-2, MutationAssessor and FATHMM. For the third analysis, we prioritised the variants outside exonic regions by considering rare variants (frequency below 1% in frequency population databases, including the Personal Genomics internal database) and with a predicted significant effect on the protein structure or function by at least three of the in silico prediction tools. Variants were then prioritised by considering their presence in regulatory regions as reported in the ENCODE database (https://www.encodeproject.org/). The manual inspection of the BAM files, by using Integrative Genomics Viewer (IGV), allowed us to evaluate the coverage of the variants and the quality of the aligned reads.
The identification of putative genomic expansions, structural variants or copy number variations was carried out by using Lumpy v0.2.13 33 and Manta v1.2.2 34 software. The ExpansionHunter tool v2.5.3 14 was adopted to estimate the size of potential repetitions of short unit sequences.
Long-read sequencing. DNA was extracted for all long-read sequencing protocols using the QIAsymphony system from skin fibroblasts (passage 6) cultured in Dulbecco's modified Eagle's Medium (DMEM; Life Technologies) with 10% fetal calf serum. Pacific Biosciences (PacBio) single molecule real-time (SMRT) sequencing data were obtained in two batches: In the first batch, two Australian FAME2 carriers (Family 1: IV-44 and IV-98) were sequenced with two flow cells per sample. Resulting bam files were converted to fastq using the SMRT Link software v5.1.0 bam2fastq program. Resulting fastq files were either mapped directly to the human genome hg38 build using NGM-LR 35 with structural variants called by Sniffles 35 or used as input for de novo assembly with Canu v1.7. In the second batch, a single sample (Family 1: IV-98) was sequenced. DNA fragment sizes were determined with the Femto Pulse capillary electrophoresis system (Agilent Technologies, Santa Clara, CA). DNA fragments of size greater than 6 kb were selected with BluePippin (Sage Science, Beverly, MA) pulsed field gel electrophoresis system. Sequencing was carried out for 20 h per SMRT cell on the Sequel system with Binding Kit 3.0 (PacBio, 101-500-400) and Sequencing Kit 3.0 (PacBio 101-427-800). Circular consensus calling was performed using CCS 3.2.1 software. Reads were mapped to the GRCh38 build of the human genome using pbmm2 with "-c 0 -L 0.01" for CCS reads and "-c 0 -L 0.1" for subreads.
Oxford nanopore data were obtained for DNA samples extracted from fibroblasts from two individuals from Family 1, as described above, and two from Family 5 (II-37 and IV-29 Fig. S2e). For each of the four participant samples, 3 µg of DNA was prepared for Oxford Nanopore 1D genomic sequencing by ligation using the SQK-LSK108 kit and was run on a FLO-MIN106 flow cell for 48 h. Basecalling was performed on MinKNOW 18.01.6 with MinKNOW Core 1.11.5 and Albacore v2.1. Data were either mapped with NGM-LR or assembled with Canu v1.7 as described below, using suggested settings for nanopore sequencing reads.
De novo whole-genome assembly of one individual with input of both PacBio and nanopore sequencing from one individual from Family 1 was carried out using the Canu v1.7 assembler with default starting parameters for a genome size of 3.6 Gbp. Recalibrated reads from Canu were mapped to the hg38 build of the human genome using NGM-LR as described above.
Repeat expansion analysis. WGS was performed for two affected individuals from Family 1 on the Illumina HiSeq X10 platform, one individual from Family 3 as described above, and three affected individuals from Family 19 on the Illumina HiSeq platform. A cohort of 69 individuals without FAME were used for comparison, with 150 bp paired-end sequencing performed on the Illumina HiSeq X platform (Kinghorn Centre for Clinical Genomics, Sydney, Australia). Library preparation for 53 of the samples used the Illumina TruSeq Nano DNA HT Library Preparation Kit; the other 16 samples used KAPA Hyper Prep Kit PCR-free library preparation.
RNA-Seq. Total RNA was extracted from patient-derived primary skin fibroblasts of four Australian/New Zealand FAME, two Italian FAME and four age-matched controls using QIAGEN RNeasy kits, as per the manufacturer's protocol. Library preparation and RNA-Seq were performed as a service by the UCLA Neuroscience Genomics Core Facility. The TruSeq v2 kit (Illumina) was used to generate unstranded libraries with 150-bp mean fragment sizes and 50-bp paired-end sequencing performed using the HiSeq2500 (Illumina). Sequence data were mapped to the GRCh38 build of the human genome using HISAT2 v2.1.0 36 . Read counts were generated with StringTie v1.3.3 36 . Differential expression between FAME and control samples was determined using the exact test from the edgeR v3.26.5 package in R v3.6.0 37 . Differentially expressed genes were filtered to false discovery rate (FDR) < 0.05 and log base 2-fold change (LFC) > = 1 or < = −1.
Quantitative PCR. RNA was extracted from four patient-derived primary skin fibroblast cell lines from Family 1 and four control fibroblast cell lines from adult donors not affected by FAME as described above under RNA-Seq. cDNA were generated from 1 μg of total RNA using the iScript reverse transcription kit (Bio-Rad, Gladesville, NSW, Australia; cat# 1708891), according to the manufacturer's protocol.
Quantification of differentially expressed transcripts was performed with the relative standard curve method using SYBR green fluorescence intensity for detection. Products were amplified in 1 × iTaq Universal SYBR Green supermix (Bio-Rad; cat# 1725121) with primers at 1μM final concentration. Each sample and standard was amplified with three technical replicates on an Applied Biosystems StepOnePlus. Expression values were determined relative to a dilution curve of a cDNA standard made from pooled control fibroblast cDNA. Specificity of products was determined by melt curve analysis at the conclusion of each run. Expression values of each gene were normalised to HPRT1 expression values from the same sample.
PCR amplification and sequencing of repeats (Rome). Pentanucleotide repeats were analysed in duplicate by long-range PCR with Expand Long Template PCR System (Roche) according to the manufacturer's recommendation. Some 200 ng genomic DNA were amplified with primers STARD7F and STARD7R (300 nM), dNTP (350 µM) buffer 1 (1×) Enzyme 0.5 U (×50 µl reaction). After 2 min of initial denaturation at 94°C, DNA samples underwent 10 cycles of amplification (denaturation 94°C for 10 s, annealing 56°C for 30 s, elongation 68°C 3 min) followed by an additional 20 cycles (94°C for 15 s, annealing 56°C for 30 s, elongation 68°C 45 s + 20 s each cycle elongation for each successive cycle). PCR products were separated by electrophoresis on 1% agarose gel. DNA was extracted from the agarose gel slice and the number of repeat units was determined by Sanger sequencing (Eurofins Genomics Sequencing Service).
Repeat-primed PCR. Primers for both Adelaide and Rome are shown in Supplementary Table 3.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Source data for Figs. 1a, 3a, 3b, Supplementary Figs. 1a, b and 4b are provided in the Source Data files of this manuscript. RNA-Seq data are available from the NCBI BioProject PRJNA563467. Whole-genome sequencing data are available from the corresponding author on request, subject to human research ethics approval and patient consent.