Huntington’s disease age at motor onset is modified by the tandem hexamer repeat in TCERG1

Huntington’s disease is caused by an expanded CAG tract in HTT. The length of the CAG tract accounts for over half the variance in age at onset of disease, and is influenced by other genetic factors, mostly implicating the DNA maintenance machinery. We examined a single nucleotide variant, rs79727797, on chromosome 5 in the TCERG1 gene, previously reported to be associated with Huntington’s disease and a quasi-tandem repeat (QTR) hexamer in exon 4 of TCERG1 with a central pure repeat. We developed a method for calling perfect and imperfect repeats from exome-sequencing data, and tested association between the QTR in TCERG1 and residual age at motor onset (after correcting for the effects of CAG length in the HTT gene) in 610 individuals with Huntington’s disease via regression analysis. We found a significant association between age at onset and the sum of the repeat lengths from both alleles of the QTR (p = 2.1 × 10−9), with each added repeat hexamer reducing age at onset by one year (95% confidence interval [0.7, 1.4]). This association explained that previously observed with rs79727797. The association with age at onset in the genome-wide association study is due to a QTR hexamer in TCERG1, translated to a glutamine/alanine tract in the protein. We could not distinguish whether this was due to cis-effects of the hexamer repeat on gene expression or of the encoded glutamine/alanine tract in the protein. These results motivate further study of the mechanisms by which TCERG1 modifies onset of HD.

the variance in age at onset [3][4][5][6]. Genome-wide association studies (GWAS) have shown that other genetic variants also influence age at onset of HD, including variants in genes in DNA damage repair pathways and sequence variants in the CAG tract [7][8][9]. The most recent genetic modifier GWAS in HD (GeM-GWAS) [8] revealed 21 independent signals at 14 loci. We observed that one of the significant loci on chromosome 5 (5BM1) contained TCERG1, the only putative genetic modifier of HD onset in the GWAS to have been previously reported [10,11]. The 5BM1 locus (146 Mbp; hg19) has one significant single nucleotide variant (SNV), rs79727797 (p = 3.8 x 10 -10 ), with each minor allele conferring 2.3 years later onset of HD than expected from the subjects' CAG repeat length. SNV rs79727797 is within the TCERG1 gene and very close to the tandem repeat locus (Fig. 1A) previously implicated in modifying HD age at onset [10,11]. TCERG1 (Transcriptional Elongation Regulator 1; previously known as CA150) protein couples transcriptional elongation and splicing, regulating the expression of many genes [12,13]. It is highly conserved across human and mouse (97.8% identity between proteins). In humans, TCERG1 is extremely intolerant to loss of function variants (observed/expected variants = 0.13, 90% CI 0.07 -0.23) and is in the 5% of genes most intolerant of amino acid missense substitutions, (observed/expected variants = 0.61, 90% CI 0.56 -0.67) [14]. TCERG1 binds to HTT and its expression can rescue mutant HTT neurotoxicity in rat and mouse model systems [15]. TCERG1 contains a repeat tract of 38 tandem hexanucleotides: a central perfect short tandem repeat (STR) of (CAGGCC)6 embedded in a larger imperfect hexanucleotide 'quasi' tandem repeat (QTR; Fig. 1A,B; chr5:145,838,546-145,838,773 on hg19). The whole tract is translated in TCERG1 protein as an imperfect 38 glutamine/alanine (QA) repeat interrupted with occasional valines (V; Additional file 1: Fig. S1).
Previously, a study of 432 American HD patients showed a nominally significant association of earlier onset with longer QTR length in TCERG1 (p = 0.032, not corrected for multiple testing) [10]. A study of 427 individuals from Venezuelan HD kindreds [11] testing 12 polymorphisms previously associated with HD gave a p-value of 0.07 (not corrected for multiple testing) comparing the 306bp allele (corresponding to the reference 38-repeat QTR) with all other alleles for association with age at onset. Neither study tested the effects of repeat length directly, instead inferring it from the length of the amplified PCR products, including the flanking primer sequences.
We directly determined the repeat tract sequence in TCERG1 in 610 HD patients by using short-read exome-sequencing data [1]. We then assessed the association of repeat alleles with age at onset of HD. We used a subset of 468 individuals for whom SNV data were available to test whether the rs79727797 variant was tagging the tandem repeat in TCERG1 and whether the tandem repeat was likely to be the functional variant involved in modifying HD age at onset.

Alleles observed at the TCERG1 hexamer repeat
Subjects came from the REGISTRY [16] and Predict-HD [17] studies, and in Registry were individuals with the largest difference between their observed age at motor onset and that expected given their CAG repeat length, and in PREDICT those with the most extreme phenotype given their CAG repeat length, as in McAllister et al. [1].
The 38-unit QTR locus is in exon 4 of TCERG1 and SNV rs79727797 just 3' to exon 19, separated by 50 kbp (Fig. 1A). The length of the QTR is polymorphic and we identified eight different alleles, mostly varying by central STR length (Fig. 1C). The reference allele (A1), with a central (CAGGCC)6 STR, is by far the most common allele, representing 91.3% of all alleles sequenced in our study (Table 1). Alternative alleles with central STRs of different lengths were observed (Fig. 1C), of which the most common was (CAGGCC)3 (4.1% of alleles; A2, Table 1). This three-repeat allele is in linkage disequilibrium with the minor allele of rs79727797: in our cohort, correlation between the SNV and allele A2 is 99%.

Association with age-at onset of HD
The distribution of genotypes observed in our study is given in Fig. 2. We tested for association between residual age at onset of HD and the QTR length. As there are two alleles, we examined the association with residual age at onset of the larger or smaller repeat length, the sum of repeat lengths, and the difference between repeat lengths in each patient. We consistently found higher levels of significance in the association between residual age at onset and the sum of the repeat lengths than in the associations with the difference between repeat lengths, or maximum or minimum repeat lengths in each individual (Additional file 1: Table   S1). The association of the sum of the QTR lengths from both alleles with residual age at onset was genome-wide significant (p = 5.0x10 -9 and 2.0x10 -8 without and with multiple testing correction, respectively) (Additional file 1: Table S1). Logistic regression analyses using the extremes of the residual age at onset showed a similar pattern. The relationship between the sum of the hexamer repeats and the residual age at onset in HD is illustrated in Fig. 3 (see also Additional file 1: Fig. S2 for equivalent analyses of STR). Panels A-C show that subjects with extreme late onset have more copies of the shorter alleles than those with extreme early onset, and this difference becomes more pronounced as the extremes become greater. The negative correlation between the sum of QTR lengths in an individual and residual age at onset of HD is shown in Fig. 3D, with one year earlier HD onset for each added repeat hexamer (black dashed line in Fig. 3D, 95% confidence interval [0.7, 1.4]). We estimated the QTR effect size using the regression with selection analysis described in Additional file 1: Supplementary Methods. Since our HD cohort mainly contains age at onset extremes, the linear regression analysis (grey dashed line in Fig. 3D) overestimates the QTR effect size, giving 2.75 years earlier for each added hexamer. However, it can be used for Black numbers mark genotype counts. Red and blue numbers indicate mean residual ages at onset for individual genotypes, early onset in red, late onset in blue.
comparison of the association significance between different models because it provides approximately the same p-value as the regression with selection analysis (Additional file 1: Table S2). Additional file 1: Table S2 shows a significant negative association between age at onset and the sum of QTR repeat lengths in both the REGISTRY and Predict-HD samples.
Notably, (Table S2), the effect size estimated in the REGISTRY sample using regression with selection (0.98 years earlier onset for each added hexamer) is similar to that observed in the Predict-HD sample, where the selection is less extreme (1.26 years earlier onset for each added hexamer). This is an indication that applying regression with selection has successfully corrected for the bias in effect size induced by the extreme onset selection in the REGISTRY sample. The associations are slightly less significant when the pure hexamer repeat length is used rather than the full repeat: p = 6.5 x 10 -9 for linear regression (Additional file 1: Table   S3). However, the sample size is relatively small, and a larger sample would be needed to establish whether there is any significant difference between these results.
The sum of QTR lengths was found to predict residual age at onset significantly better than the difference in QTR lengths, the minimum or maximum QTR length, or the number of copies of the 3-repeat allele (Additional file 1: Table S4, Methods). QTR lengths are thus likely to influence age at onset in an additive manner.
The relationship of the association between residual age at onset and the sum of QTR repeat lengths with those of neighbouring SNVs is shown in Fig. 4 for the 468 individuals with both SNV and sequencing data. In these individuals, the significance of the association between residual age at onset and sum of repeat lengths (p = 1.2x10 -7 ) was greater than that observed with the most significant SNV, rs79727797 (p = 3.6x10 -5 ). To determine whether the sum of the QTR lengths or rs79727797 was driving the association with age at onset, we performed a conditional analysis in the 468 individuals with both SNV and sequencing data. When the association of rs79727797 with residual age at onset was conditioned on the sum of the QTR lengths, the p-value in our sequenced cohort dropped from p = 3.6x10 -5 to p = 0.83. However, conditioning the association of age at onset with the sum of QTR lengths on rs79727797 genotypes, it remained significant (p = 9.2x10 -4 ), indicating that the hexanucleotide QTR, and not rs7977797, is likely to be driving the signal in our data (Fig. 4). Manhattan plots of SNV associations with residual age at onset for the 468 individuals with SNV data, conditioning on the sum of QTR lengths and rs7977797 in turn, are shown in Additional file 1: Fig. S3.

Gene expression analyses
TCERG1 has significant cis-expression quantitative trait loci (eQTLs), which can be used in conjunction with GWAS data to predict gene expression [18] in several tissues: GTeX [19] whole blood, PsychEncode [20] cortex, and eQTLGen whole blood [21]. rs79727797 is significantly associated only with expression of the nearby gene PPP2R2B (expansions in which cause SCA12) in eQTLGen (p=1.13x10 -16 ), with the A allele that is associated with later  Table   S5). Likewise, the most significant eQTL SNVs for TCERG1 in eQTLGen are not associated with HD age at onset in GeM (Additional file 3: Table S6). Notably, rs79727797 is not significantly associated with TCERG1 expression (p=0.45). This indicates that gene expression (at least in whole blood) is unlikely to be the mechanism through which TCERG1 influences age at onset in HD. This is corroborated by summary Mendelian Randomisation analyses using the eQTLGen expression data, which were non-significant (p=0.974 for TCERG1, p=0.07 for PPP2R2B). Co-localization analyses further showed that the eQTL and GWAS signals were different for both genes (colocalization probability=0). The lack of overlap between GeM GWAS association and eQTLGen and eQTL for TCERG1 and PPR2R2B can be seen graphically in Additional file 1: Figs. S3 and S4.
We used FUSION [22] to perform TWAS analyses of the GeM dataset using the PsychENCODE [20] cortex expression data. There was a significant negative association between TCERG1 expression and age at onset (Z=-2.71, p=0.00671): increased TCERG1 expression is associated with earlier HD onset. Although the plot of eQTL and GWAS association (Additional file 1: Fig. S7) shows some overlap in signal, as does the table of significant eQTLs (Additional file 4: Table S7), a co-localization analysis does not show evidence that the eQTL and GWAS signals share the same causal variant (colocalization probability=0.0745). However, this analysis is inconclusive due to the relatively weak eQTL and GWAS signals (note that rs79727797 is not included in the analysis since the PsychENCODE sample is too small to demonstrate association with expression). No TWAS analyses were possible for PPP2R2B, since an insufficient proportion of variation in expression is attributable to SNVs. However, the plot of PsychENCODE eQTL and GWAS association (Additional file 1: Fig. S7) and table of significant eQTLs (Additional file 5: Table S8) show little overlap, which is supported by a colocalization analysis (colocalization probability=0.0376).

Discussion
TCERG1 is the only previously detected candidate gene for modifying HD age at onset to be confirmed by genome-wide association [8]. Our conditional analysis is consistent with the hexanucleotide tandem repeat in exon 4 explaining the signal attributed to the GWAS-significant SNV rs79727797 (which tags the three-repeat allele A2). The strength of the effect is directly proportional to the repeat length of the TCERG1 QTR, with shorter repeats associated with later onset and longer repeats with earlier onset of HD. The previous finding that a slightly earlier than expected age at onset was detected in individuals whose longest allele is one and half hexanucleotide repeats longer than the reference [10] is consistent with our results (the participants with the genotype (38,40) in Fig. 2A most likely correspond to the inaccurately sized genotype (38,39.5) in [10]). The effect of the number of hexanucleotide repeats appears to be additive with each additional repeat giving one year earlier onset of HD: sum of repeats is significantly better associated with age at onset than either individual repeat allele or the difference between them (Additional file1: Table S4). The previous study [10] did not find that fitting the combined length of the two alleles improved the significance of the association with age at onset but did not formally compare the various models for allele length.
That we were able to show a significant difference is likely due both to a larger sample, in which power was further increased by sampling individuals with extreme ages at onset, and to testing repeat lengths directly rather than allele lengths. Given the GWAS significant signal at this locus in an unselected HD population [8] we expect that this finding will replicate in unselected HD patients. Replication through sequencing the hexamer repeat in a larger unselected cohort is needed to assess the true effect size and the relationship of the modifier effect to repeat length.
TCERG1 has known functions in transcriptional elongation and splicing [12,13]. It is in the top 5% of genes most intolerant of missense mutations, suggesting an essential role in cell biology [14]. How the TCERG1 hexanucleotide repeat length modifies HD onset is unknown.
Possibilities include cis or trans modulation of TCERG1 or other gene expression, modulation of RNA splicing or transcription-splicing coupling, and effects on somatic expansion of the CAG repeat in HTT. Effects could be mediated by the tandem repeat in DNA or RNA, or by the translated (QA)n tract in protein. The QTR has a slightly stronger association signal than the central STR, which may reflect an association with the length of the QA repeat in the protein rather than the CAGGCC hexamer in the DNA but more work is required to substantiate this observation. In DNA, repeat loci can modulate gene expression in cis [23,24], while transcribed repeats in RNA, especially tri-and hexamer repeats, can alter splicing, associate with R-loops and alter RNA stability or binding [25]. The hexamer repeat in TCERG1 could act via altering expression of TCERG1 or the nearby gene PPP2R2B. In our analysis evidence for the involvement of TCERG1/PPP2R2B expression in modification of HD age at onset is unclear. It was not possible to test the association of the TCERG1 repeat with expression directly and the tagging SNV (rs79727797) is relatively rare (minor allele frequency = 2.4%), so requires a very large expression sample to show any association. Only eQTLGen (whole blood) is sufficiently large (n=31,684), and in this sample rs79727797 is significantly associated with PPP2R2B rather than TCERG1 expression. However, the summary Mendelian Randomisation analyses are not significant for either gene, suggesting that neither TCERG1 nor PPP2R2B expression is causally involved in modifying age at onset in HD, at least in blood. A significant TWAS association was observed in the PsychENCODE cortex expression data between reduced TCERG1 expression and later age at onset, although there was little evidence that the eQTL and GWAS signals were co-localized. However, rs79727797 was not part of the TWAS predictor, due to the insufficient size of the PsychENCODE eQTL dataset.
This weakened the GWAS signal, and thus reduced the power of the co-localization analysis.
Furthermore, it was impossible to perform TWAS or co-localization analyses in caudate or striatum due to the lack of suitable eQTL datasets (the GTEx caudate sample is too small to show eQTL association with TCERG1). Hodges et al. [26] did not observe significant differential expression of TCERG1 between HD patients and controls in caudate, although this study assessed expression via microarrays rather than more modern techniques. Langfelder et al. [27] observed significantly increased TCERG1 expression in the striata of Q111, Q140 and Q175 mice relative to wild type. However, this has been suggested to be a compensatory homeostatic response to promote neuron survival [28], and such an effect would be difficult to model in a human eQTL sample. Therefore, it is possible that reduced TCERG1 expression is associated with later onset of HD but corroborating evidence from other samples or direct experimentation is required for confirmation. Consistent with the observations of Langfelder et al. [27], immunostaining of post-mortem human brain showed increased nuclear TCERG1 in HD caudate and cortex compared with normal controls, and increased staining with HD grade, suggesting that there may be a localisation effect of the repeat as suggested previously [15] and that excess nuclear TCERG1 is deleterious in HD [10].
The hexanucleotide tandem repeat in TCERG1 encodes an imperfect (QA)n repeat in the protein and there are conflicting data on the role of this repeat in modulating normal TCERG1 function. One reporter assay found the QA repeat to be dispensable for TCERG-mediated transcriptional repression [15], whereas a larger study in two cell lines found the QA repeat to be required for TCERG1-induced repression of the C/EBP transcription factor [29]. A minimum of 17 QA repeats was required for this activity. When the QA repeat was deleted QA-TCERG1 colocalised with wild-type TCERG1 and prevented its canonical relocalisation from nuclear speckles to pericentromeric regions, implicating a possible dominant negative mode of action. This is consistent with the QA repeat being required to retain the nuclear localisation of TCERG1 [15], though not for its effect on transcription, although these overexpression experiments do not distinguish the effects of DNA, RNA and protein. A dominant negative mode of action would be inconsistent with the additive genetic effect we observe, although the effects we see relate only to differences of up to 5 units of the QA repeat in each TCERG1 allele, rather than a complete deletion of the QA tract. Effects of this smaller modulation in the QA repeat are therefore likely to be more subtle. Taken together with the evidence that increased nuclear localisation of TCERG1 is seen in HD mouse brain [27] it is plausible that the alteration in nuclear localisation conferred by the repeat could be responsible for the observed effect of TCERG1 on age at onset. It remains possible that TCERG1 expresses a novel function in cells with an expanded repeat unrelated to its normal function.
Many of the known genetic modifiers of age at onset of HD are proteins that act on DNA, particularly those involved in mismatch repair. These appear to operate by altering the levels of instability and expansion of the HTT CAG repeat, though there is also evidence for wider DNA repair deficits in HD [30,31]. It is possible that TCERG1 modifies HD onset by acting directly or indirectly on the mechanisms regulating somatic expansion. Expansions of the inherited HTT CAG length are most marked in non-dividing neurons, suggesting that these events take place during transcription or DNA repair. TCERG1 affects the processivity of RNA polymerase and splicing events during transcription, especially co-transcription [12,13].
During co-transcription it appears to bind and dissociate from stalled spliceosome complexes transiently [13] and the QA repeat might modulate this transient binding as it does with the C/EBP interaction [29]. HTT exon 1 contains an RNAPII pause site [32], associated with cotranscriptional splicing [33][34][35]. Pausing associated with co-transcriptional splicing of HTT could stabilise the DNA-RNA hybrid R-loops that occur during active transcription [36][37][38].
Stabilised R-loops would give opportunities for increased binding and processing by the DNA repair machinery, and promote somatic expansion of the CAG repeat in HTT exon 1. Pausing might also promote aberrant splicing of HTT exon 1 which is regulated by RNAPII transcription speed [39]. This would likely generate a vicious cycle as lengthening repeats lead to increased RNAPII pausing followed by further dysregulation of exon 1 splicing and production of toxic exon 1 HTT species [40]. Stabilised R-loops are also associated with increased levels of DNA breaks in CAG/CTG repeats cleaved by MuL, encoded by MLH1/MLH3, both associated with modulating the length of CAG and other expansions [41][42][43]: MLH1 is associated with altered age at onset of HD [44]. Of note, knockdown of TCERG1 in HEK293T cells leads to dysregulation of over 400 genes, including downregulation of MLH1 [12].
The role of TCERG1 in transcription could signal its involvement in the widespread transcriptional dysregulation that is seen in HD [12,26,45]. TCERG1 is involved in the assembly of small nuclear ribonucleoproteins in mRNA processing [46]. It also interacts with huntingtin [10]. In yeast, proteins containing a (QA)15 tract can bind to a fragment of mutant huntingtin containing 103 glutamines to suppress its toxicity [47]. In amyotrophic lateral sclerosis and some cases of frontotemporal dementia, TCERG1 increases the levels of TDP-43, the major constituent of the pathological hallmark inclusions in mammalian cells [48].
Notably, TDP-43 is observed alongside mHTT in extranuclear pathogenic inclusions in HD [49]. The genetic association of the CAGGCC/QA repeat in TCERG1 with age at onset of HD is robust, with a hint that it might operate at level of the protein rather than DNA. More work is needed to clarify the mechanism by which it alters onset in HD and whether this is related to previously reported pathophysiologies or a new pathway. It provides a further potential treatment target in this incurable disease.

Conclusions
We have identified a variable hexanucleotide QTR in TCERG1 as a modifier of HD onset, with one year reduction in age at onset of HD for each additional hexamer repeat. Elucidation of the mechanism of its modifier effect will inform research into pathogenesis in HD and, potentially, other repeat expansion disorders, and could identify new therapeutic targets.

Subject details
We analysed genetic and phenotypic data of 506 patients with HD from the EHDN REGISTRY study (http://www.ehdn.org; [16]; initially we had 507 individuals, but then we excluded one individual with unreliably called TCERG1 QTR due to low sequencing depth coverage), and 104 individuals from the Predict study [50]. Ethical approval for Registry was obtained in each For the remaining 10 individuals, we used BioRep CAG lengths determined using Registry protocols (https://www.enroll-hd.org/enrollhd_documents/2016-10-R1/registry-protocol-3.0.pdf). For individuals from the Predict study, DNA was obtained from blood DNA and we used the CAG length recorded in the study. SNV genotype data were available for 468 of the REGISTRY individuals, as part of the GeM GWAS [8].
Age at onset was assessed as described in [11]. For REGISTRY age at motor onset data, where onset was classified as motor or oculomotor by the rating clinician, the clinician's estimate of onset was used for onset estimation. For all other onset types, we used the clinical characteristics questionnaire for motor symptoms. Predict age at motor onset was as recorded in the study, determined using the age where the diagnostic confidence level = 4. The selection of the REGISTRY and Predict samples are described in detail in [11]. Briefly, the REGISTRY samples were selected for having extreme early or late onset compared to that predicted by their CAG length. The Predict-HD samples were selected based on extreme predicted early or late onset. These originally constituted 232 individuals, of whom we analysed on those 104 who had a known age at motor onset.

Calling tandem hexamer from whole exome sequencing (WES) data
For the Registry-HD cohort (N=506), sequencing was performed at Cardiff University [1].  to be treated as substitutions. After that we consider loci where some sequence reads have nucleotides different from the reference. We utilise these loci to retrieve the allele sequences by separating the reads into two groups in such a way that all reads in a single group have the same nucleotides at these loci.

Sanger sequencing to confirm QTR sequences
To validate our tandem hexamer calls from WES data, we performed Sanger sequencing of four samples: two homozygous for the reference QTR allele (A1/A1 genotype), one heterozygous for a shorter QTR allele (A1/A2 genotype), and one heterozygous for a longer QTR allele (A1/A4 genotype). The QTR locus in TCERG1 was amplified by PCR using forward (5'-AACTGACACCTATGCTTG-3') and reverse (5'-GTTGAAGTGGATACTGCA-3') primers as described in the reference [10]. Amplicons were Sanger sequenced (LGC, Germany) in both directions using forward (5'-AACTGACACCTATGCTTGCAG-3') and reverse (5'-GAAGTGGATACTGCAGGTGC-3') primers, and sequences compared to their respective calls from short-read exome sequencing data. Sequences from Sanger and exome sequencing matched in each of the four cases.

Measuring TCERG1 QTR lengths using capillary electrophoresis
To confirm TCERG1 QTR lengths derived from exome-sequencing data, the QTR locus in TCERG1 was amplified by PCR using a fluorescently-labelled forward (5'-FAM-AACTGACACCTATGCTTG-3') and unlabelled reverse (5'-GTTGAAGTGGATACTGCA-3') primer before sizing by capillary electrophoresis (ABI 3730 genetic analyzer) and Genescan against a LIZ600 ladder of size standards (Thermofisher). In total we tested QTR length calls for 101 individuals from the Registry-HD sample: the 73 who had at least one non-reference QTR length allele (A2-A8) and 28 who were called as homozygous for the reference (A1) allele. The reference allele A1 was predicted to produce a PCR fragment of 307 bp. In all samples this allele was consistently sized at 299 bp by capillary electrophoresis. We attributed this to the repetitive nature of the sequence and the specific analyzer used. In all 101 individuals tested, allelic QTR lengths relative to the reference A1 allele QTR length exactly matched those called using exome-sequencing data.

Calculation of age at onset residuals
Expected ages of onset were calculated from patient CAG length data (measured as described above) using the Langbehn model [53]. Residual ages at motor onset were then calculated taking the difference between the expected onset from the recorded clinical age at motor onset, as performed elsewhere [8].

Association of age at onset with STR/QTR repeats
Linear regression was performed of the age at onset residual on the repeat statistic (sum, diff, max, min, #3 rep). Since the sample was selected to have extreme values (positive and negative) of this residual, linear regression is likely to overestimate the effect of the repeat on onset in the general HD patient population. Therefore, regression with selection (see Additional file 1: Supplementary Methods) was used to estimate the true effect size. A dichotomous phenotype was derived by selecting individuals with extreme late (positive residual greater than a pre-defined criterion) or early (negative residual less than a pre-defined criterion) onset.
Association of the dichotomous phenotype with repeat statistic was tested via logistic regression.
To formally test which repeat statistics best predict age at onset, we proceeded as follows: For each pair of statistics A and B, a linear regression of residual age-at-onset on statistic A was performed as a baseline. Then statistic B was added to the regression and the significance of the improvement in fit assessed using ANOVA. Statistics were defined as "best fitting" if the addition of no other statistic gave a significant improvement in fit.

Analyses to test for correlation between genetically predicted expression and age at onset
FUSION [22] was used to perform TWAS analyses on the PsychENCODE data using precomputed predictors downloaded from http://resource.psychencode.org/. Summary Mendelian Randomisation was used to perform TWAS analyses on eQTLGen blood expression using cis-eQTL data downloaded from https://www.eqtlgen.org/cis-eqtls.html . Colocalisation analyses to test if eQTL and age at onset signal share the same causal SNV were performed using COLOC [54].

Supplementary Information
Additional file 1: Supplementary Information, Table S1. Significance of the association between TCERG1 exon 4 quasi-tandem repeat (QTR) and residual age at onset ℛ for the various ways of coding the repeat, Table S2. Significance of the association between the sum of TCERG1 QTR lengths and residual age at onset in REGISTRY, Predict-HD and combined samples, Table S3. As Table S1, but for short tandem repeat (STR), Table S4. Significance

Regression with selection.
Additional file 2:

Availability of data and materials
The datasets supporting the conclusions of this article are included within the article and its additional files. The software performing regression with selection and STR/QTR calling are available from https://github.com/LobanovSV at the RegressionWithSelection and UVC repositories, respectively.   . S2. As Fig. 3, but for short tandem repeat (STR).

Supplementary Methods: Regression with selection
Section S1: Initial selection We performed whole-exome sequencing of a small sub-group of the EHDN REGISTRY study. To increase statistical power, we selected individuals with the largest absolute value of the residual age at onset .
The probability density function of the residual ages at onset in the selected sample ( ) is that of a normal distribution with mean 0 and standard deviation σ Here, is the standard deviation of the initial HD population (EHDN REGISTRY study), thr is the selection threshold, and ∆ is the selection width, which was infinitely small, ∆ → 0.
The expected probability density of the initial HD population ℕ ( ), selection function ( ), and expected probability density ( ) of the selected HD sub-group are shown in Fig. S9.   Fig. S9. a Expected probability density of the initial HD population (normal distribution); b Selection function with infinitely small ∆; c Expected probability density of the small HD sub-group with largest absolute value of the residual age at onset | |.

Section S2: Correction of the age at onset residuals
To improve the accuracy of the correction of age at onset for CAG length, we additionally measured the length of the uninterrupted HTT exon 1 CAG repeat using an Illumina MiSeq platform for 496 individuals from our HD cohort and corrected the HD age at onset residuals. Some individuals who had age at onset residual above the threshold thr shifted to the region with | | below the threshold thr after correction. Conversely, some individuals who would have corrected age at onset residual above the threshold thr , were not selected because their uncorrected | | were below the threshold thr . The correction has therefore widened the selection function, corresponding to a non-zero selection width ∆.
The probability density function ( ) of our HD group with corrected residuals can be modelled in the same way as was described in Section S1, but with finite selection width ∆.
The observed cumulative probability (| |) and expected one (| |) with optimal parameters = 7.02, thr = 17.6, ∆ = 3.30 are shown in Fig. S10A. The one-sample Kolmogorov-Smirnov p-value is 0.81. The selection and probability densities are shown in Fig. S10B,C. where ℕ ( ) is the probability density function of normal distribution (see Section S1). Here, , 0 , and 1 are unknown standard deviation, intercept, and effect size, respectively; and are age at onset residual and sum of TCERG1 QTR lengths of a specific individual.
Note, the integral in the denominator depends on the individual's sum of QTR lengths . Here, ( ) is the selection function (see Section S1).