Genomic footprints of activated telomere maintenance mechanisms in cancer

Unlimited replicative potential is a hallmark of cancer. It requires the maintenance of telomere repeats either by telomerase activation or by an alternative lengthening of telomeres (ALT) pathway. Here, we dissected whole-genome sequencing data of 2,519 matched tumor-control samples from 36 different tumor types to characterize the genomic footprints of these telomere maintenance mechanisms. While the telomere content of tumors with ALT-associated mutations was increased, tumors with putative telomerase activation showed a moderate decrease of telomere content. A systematic search discovered 2,683 somatic integrations of telomeric sequences into non-telomeric DNA. These telomere insertions were distributed across the genome and strongly correlated with increased telomere content, genomic breakpoint frequency and ALT-associated mutations. Moreover, ALT-associated mutations were significantly linked to telomere variant repeats, especially if embedded in canonical TTAGGG telomere repeats. This includes a previously undescribed accumulation of TTCGGG. Overall, our findings provide new insight into the recurrent genomic alterations that are associated with the establishment of different telomere maintenance mechanisms in tumors.

DNA is synthesized onto the chromosome ends by telomerase, an enzyme that is composed of the reverse transcriptase TERT and the RNA template TERC and is active in the germline and stem cells, but absent in most somatic cells 6 . Telomerase is up-regulated in about 85% of human cancers by different mechanisms, including TERT amplifications 7 , rearrangements [8][9][10] or mutations in the TERT promoter 11,12 . The remaining tumors employ an alternative lengthening of telomeres (ALT) pathway, which is based on DNA recombination of telomeric sequences 13 . Details on the ALT mechanism remain elusive but it has been associated with loss-of-function mutations in the chromatin remodeling genes ATRX (a-thalassaemia/mental retardation syndrome X-linked) and DAXX (death-domain associated protein) 14 . Telomeres of ALT cells characteristically have heterogeneous lengths and contain a range of telomere variant repeats (TVRs) [15][16][17] . Other hallmarks of ALT include ALT-associated promyelocytic leukemia (PML) nuclear bodies (APBs) and abundance of extrachromosomal telomeric repeats of various forms (such as C-circles) 13 .
While normally located at the chromosome termini, telomere sequences are also found in intrachromosomal regions. As such, interstitial telomeric sequences with large blocks of telomere repeats exist in humans and other species, which probably arose from ancestral genome rearrangements or other evolutionary events 18 . Recently, ALT-specific, targeted telomere insertions into chromosomes that lead to genomic instability have also been described 19 . Another source for unexpected telomere repeat sites is the stabilizing function of telomeres at broken chromosomes. After a double-strand break, telomeres can be added de novo to the unprotected break sites ("telomere healing") 20,21 or acquired from other chromosomal positions ("telomere capture") 22,23 .
Here, we characterized the telomere landscape of 2,519 tumor samples from 36 different tumor types using whole genome sequencing data. After searching for ALT-associated and putative telomeraseactivating mutations, we focused on differences between these two sample subsets in terms of telomere content, tumor-specific telomere insertions and accumulation of TVRs. Our findings show significant increases in telomere content and number of telomere insertions in tumor samples with ALT-associated mutations. We further show that singleton TVRs in TTAGGG stretches are particularly informative about possible underlying telomere maintenance mechanisms (TMMs) and reveal previously uncharacterized TVRs specifically enriched in tumors with different TMM-associated mutations.

Telomere content across cohorts
Due to the repetitive nature of telomere sequences, short sequencing reads from telomeres cannot be uniquely aligned to individual chromosomes. However, a mean telomere content can be estimated from the number of reads containing telomere sequences 17,24-27 . Here, we extracted reads containing at least six telomere repeats per 100 bases, allowing the canonical telomere repeat TTAGGG and the three most common TVRs TCAGGG, TGAGGG and TTGGGG. The telomere content was defined as the number of unaligned telomere reads normalized by sequencing coverage and GC-content. In this way, the telomere content of 2,519 tumor samples and matched controls from 36 different tumor types was determined.
Telomere content of the controls anti-correlated with age (r = -0.36, Spearman correlation) ( Supplementary Fig. 1a). The strong correlation (r = 0.47, Spearman correlation) between the telomere content of the tumor and control samples is partially explained by this age effect ( Supplementary   Fig. 1b). We normalized for the correlation by computing the ratio of tumor to control telomere content.
Most tumor samples had a lower telomere content than the matched control (Fig. 1a). However, there were systematic differences between the different tumor types. Among those with the highest telomere content increase were osteosarcomas and leiomyosarcomas (median telomere content tumor/control log2 ratios = 0.7 and 0.3, respectively). A particularly low telomere content was found in colorectal adenocarcinoma and medulloblastoma (median telomere content tumor/control log2 ratios = -1.0).

Prevalence of TMM-associated mutations
Different types of mutations in ATRX or DAXX and around TERT have been associated with ALT and telomerase activation. We therefore searched for these types of somatic mutations to infer the active TMM. In total, 74 tumor samples had truncating ATRX (n = 63) or DAXX alterations (n = 11) and are referred to as ATRX/DAXX trunc in the following. Of note, 10  The structural variations upstream of TERT were strikingly focal ( Supplementary Fig. 2), suggesting an advantage of tumors with rearrangements near the TERT transcription start site (TSS). Moreover, 40% (n = 25/62) of the juxtaposed positions within 20 kb upstream of the TERT TSS overlapped directly with enhancers from the dbSUPER database 28 . In contrast, only 13% (n = 9/69) of the juxtaposed positions between 20 and 1,000 kb corresponded to a predicted super-enhancer. These results point to "enhancer hijacking" near the TERT TSS, a phenomenon which has been described in neuroblastoma 8,9 and for which indications have recently been found in further cancer types 29 .
The telomere content in TERT mod samples differed significantly from that in ATRX/DAXX trunc samples (p = 2.4 × 10 -9 , Wilcoxon rank-sum test; Fig. 1c, a detailed overview is shown in Supplementary Fig. 3a Wilcoxon rank-sum test; Fig. 1d, a detailed overview is shown in Supplementary Fig. 3b).

Telomere insertions occur frequently in tumors with ALT-associated mutations
To find insertions of telomeres into non-telomeric regions of the genome, we searched for tumor-specific discordant paired-end reads where one end maps to the chromosome and the other end is telomeric.
Exact positions of the insertions were determined from reads spanning the junction site and visually inspected (examples in Fig. 2 and Supplementary Fig. 4).
Overall, 2,683 telomere insertions were detected. These were distributed unevenly between samples and different tumor types (Fig. 3a). While no telomere insertions were found in 73% of the tumor samples, the remaining samples had between one and 228 telomere insertion events. The tumor types with the highest amount of telomere insertions per tumor sample were leiomyosarcoma and osteosarcoma, both of which also had a relatively high mean telomere content. In fact, the number of telomere insertions positively correlated with the telomere content (r = 0.19, Spearman correlation).
Moreover, the number of telomere insertions was associated with the number of genomic break points in the sample (r = 0.38, Spearman correlation). To test for a synergistic effect, linear models that predict telomere insertions from telomere content and breakpoint abundance with and without an interaction term were computed. The models with the interaction term (p = 8.8 × 10 -234 ) performed substantially better than purely additive models (p = 5.8 × 10 -90 ).
There was clearly a higher percentage of samples with telomere insertions in ATRX/DAXX trunc tumors (70%) than TERT mod tumors (29%) (Fig. 3b). As expected, ATRX/DAXX trunc samples also had a higher number of breakpoints (mean = 722) than TERT mod samples (mean = 298) (Fig. 3c) Table 1). The exceptions are PLCB2 and ABCC8, whose homologues have so far only been reported in association with telomere length regulation in yeast 33,34 .
The detected telomere insertions were scattered across different chromosomes and regions within the chromosome ( Supplementary Fig. 5). No clear preferential insertion sites were identified, but several de novo telomere junctions occurred at the chromosome ends (3% within 50 kb of the first or last chromosomal segment). A total of 44% of the telomere insertions were in genes, and 8% of these disrupted exons. Several tumor suppressor genes were affected, e.g. CHEK1 encoding for a protein involved in cell cycle arrest upon DNA damage 35 (Fig. 2). This indicates that telomere insertions may directly contribute to tumor development.
Of note, patterns of microhomology were observed in 79% of telomere insertions with t-type repeats at the junction site ( Supplementary Fig. 6).

Telomere insertions often coincide with loss of the adjacent chromosomal segment
Most of the telomere insertions were one-sided (98%), i.e. telomere sequences were only attached to one side of the breakpoint (example in Fig. 2). Telomere insertions were defined as two-sided if there was a second telomere insertion event downstream in the opposite orientation (example in Supplementary Fig. 4). Because so many breakpoints were one-sided, we investigated the fate of the corresponding broken fragment using complementary information from copy number changes and structural variation annotation. As expected, one-sided telomere insertions coincided most frequently with copy number loss of the adjacent segment (46%) (Fig. 3e). In contrast, copy number gains of the fragment were rare (6%). Surprisingly, telomere insertions were frequently located at copy number neutral sites (43%). Overlaps with regions of chromothripsis were found for 24% and structural variations without chromothripsis overlap (including telomere insertions) were detected near the insertion site for 28% of the copy-number neutral cases.

Singleton TVRs are enriched in ATRX/DAXX trunc samples
It has previously been shown that ALT leads to an increased integration of TVRs into telomeres, the most common ones being hexamers of the type NNNGGG 17 . To detect differences in the telomere composition of ATRX/DAXX trunc and TERT mod tumors, we therefore searched for NNNGGG repeats in telomere reads. The most frequent TVRs across all tumor samples were TGAGGG, TCAGGG and TTGGGG ( Supplementary Fig. 7), which are known to be enriched in proximal telomeric regions 2,3 .
These and the seven other most frequent TVRs (TAAGGG, GTAGGG, CATGGG, TTCGGG, CTAGGG, TTTGGG and ATAGGG) were chosen to search for common telomere repeat combinations. For this, the neighboring 18 base pairs on either side of the TVRs were determined (Supplementary Table 2).
Most TVRs were surrounded by many different pattern combinations (e.g. TTGGGG). Others were dominated by a certain repeat context, which was similar in ATRX/DAXX trunc and TERT mod tumors (e.g. CATGGG or ATAGGG). However, TTCGGG stood out, as 33% of the TVRs in ATRX/DAXX trunc samples were surrounded by canonical t-type repeats, whereas this context was observed in only 2% of TTCGGG TVRs in TERT mod tumors.
Comparison of variant hexamers surrounded by t-type repeats ("singletons") to TVRs in an arbitrary sequence context further revealed that the former are generally better suited to distinguish ALT-positive from telomerase-positive samples ( Supplementary Fig. 8). The remaining variant analysis therefore focused on singleton repeats. CATGGG was excluded as it did not occur as singleton repeats. For the other TVRs, the median of absolute counts varied between 13 and 112, but counts in individual tumor samples reached more than ten thousand ( Supplementary Fig. 9).

ALT prediction
ALT has several different hallmarks with which it can be reliably identified 13  The variables with the highest importance for the classification were the number of breakpoints and the divergence of observed TTCGGG and TTTGGG singleton TVRs to the expected count (Supplementary Table 5). The scores resulting from the classifier can be interpreted as an ALT probability. As expected, ATRX/DAXX trunc had a high ALT probability (mean = 0.90), while TERT mod samples had a low ALT probability (mean = 0.15, Supplementary Fig. 14). A total of 15 samples without ATRX/DAXX trunc mutations had a ALT probability of over 0.9, of which two had non-truncating ATRX/DAXX mutations and one sample had a truncating mutation in ATRX and a TERT amplification. Across the entire dataset, most samples had a low ALT probability (Fig. 5), suggesting that their TMM is telomerase-based. This included some samples with ATRX/DAXX missense mutations, suggesting that the mutations in those samples may be more of a passenger event than functionally relevant. Tumor types with a high ALTprobability were leiomyosarcoma, osteosarcoma and pancreatic endocrine tumors, in keeping with the known high prevalence of ALT in these entities 36,37 .

Discussion
In this study, we have shown that the presence of ALT-associated mutations in tumors correlates with increased telomere content, enrichment of isolated TVRs in t-type context (singletons), a higher number of genomic breakpoints and intrachromosomal telomere insertions (Fig. 6). In contrast, tumors with mutations associated with a possible telomerase-activation showed moderate decrease of telomere content and increased TERT expression. The observed telomere content increase in ALT-associated versus the decrease in telomerase-associated samples is in agreement with the recent findings of Barthel et al. 29 . The higher telomere content in ALT-positive tumors could indicate that the negative feedback loop that constrains telomere elongation to a physiological level in healthy telomeraseexpressing cells 38,39 is bypassed by the ALT process, while it seems to remain intact in telomerasepositive tumors. In contrast, ALT-positive tumors are subject to a less regulated process, which circumvents negative feedback regulation. In addition to telomere elongation, the increase of telomere content in ALT-positive tumors detected by sequencing-based methods may partly stem from aberrant intra-chromosomal telomere insertions 19 or extra-chromosomal telomeric DNA 40 .
In our study, we systematically mapped telomere insertions into non-telomeric genomic regions using whole-genome sequencing data. They were most frequently accompanied by a loss of the adjacent chromosomal segment or located at copy number neutral sites. Surprisingly, the latter telomere Telomere read extraction and computational telomere content estimation.
The telomere content of WGS samples was determined using the software tool TelomereHunter (www.dkfz.de/en/applied-bioinformatics/telomerehunter/telomerehunter.html) 45 . In short, telomeric reads containing six non-consecutive instances of the four most common telomeric repeat types (TTAGGG, TCAGGG, TGAGGG and TTGGGG) were extracted. For the further analysis, only unmapped reads or reads with a very low alignment confidence (mapping quality lower than 8) were considered. The telomere content was determined by normalizing the telomere read count by all reads in the sample with a GC-content of 48-52%.

Determining TMM-associated mutations.
Samples with a truncating ATRX or DAXX alteration (frame-shift insertion/deletion, stop-codon gain or structural variation breakpoint within the gene) were defined as ATRX/DAXX trunc , samples with other simple nucleotide variants were defined as ATRX/DAXX non-trunc . Samples with a structural variation breakpoint on the plus strand 20 kb upstream of TERT the transcription start site were defined as TERTp SV . TERT amp samples had at least six additional copies of the TERT gene compared to the mean ploidy of the sample. Tumor samples with a C228T or C250T TERT promoter mutation were defined as TERTp mut . Due to the low sequencing coverage at the TERT promoter, these mutations were called using less stringent criteria (at least two reads with the mutated base, mutational frequency of at least 20%). If multiple of these TERT modifications were present, the sample was defined as TERT mult .
Samples with these TERT alterations were summarized as TERT mod . Samples without any of these alterations were defined as "Other". If a sample had both a TERT mod alteration and an ATRX/DAXX alteration, it was defined as "Mixed" for detailed analysis or included in "Other" for summary analysis.

Overlap of juxtaposed positions upstream of TERT and predicted super-enhancers.
For the closest structural variation (SV) of each tumor sample to the TERT TSS, the juxtaposed genomic coordinates were compared to 65,950 predicted super-enhancers from the dbSUPER database 28 . Only SVs on the plus strand and within 1 mb of the TERT TSS were considered. Overlaps of juxtaposed positions with super-enhancer sites were defined as direct overlaps. Super-enhancer sites within 1 mb of the juxtaposed position were defined as indirect overlaps.

Telomere insertion detection.
To find insertions of telomeric sequences into non-telomeric regions in the genome, we searched for tumor-specific discordant paired-end reads, where one end was an extracted telomere read and the other end was non-telomeric and uniquely mapped to a chromosome (mapping quality > 30). In 1 kb

Chromothripsis detection.
To identify chromothripsis events, we extended the set of statistical criteria proposed by Korbel and Campbell 48 . The basic idea is to determine whether there is a statistically significant number of interleaved structural variants (SVs) in a contiguous genomic region. We did this by constructing a graph whose nodes correspond to SVs and whose edges connect interleaved SVs. The identified clusters of SVs were also tested for the presence of alternating copy number and loss-of-heterozygosity patterns.
The resulting chromothripsis calls were validated visually. The full description of the methodology and the detailed patterns of chromothripsis events in the genomes will be reported in a separate manuscript

Copy number changes at telomere insertion sites.
Copy numbers of chromosomal segments were obtained from the PCAWG consensus calls (REF when available). Copy numbers reveal gains or losses of chromosomal segments based on coverage and Ballele frequency, but were here limited to segments of at least 10 kb. The breakpoint estimations could differ from the actual site by up to 50 kb. Therefore, telomere insertions were assigned to the closest breakpoint within 50 kb. If there was no breakpoint within 50 kb or the copy numbers at either side of the telomere insertion were the same, the copy number change at the telomere insertion was defined as neutral.

Structural variations near telomere insertion sites.
Structural variation annotation was obtained from the PCAWG consensus calls (REF when available), which was based on discordant mate pairs and split reads, providing exact breakpoints. Because copy number variations smaller than 10 kb were not detected by copy number callers, small deletions next to the telomere insertion site may be missed. We therefore searched for structural variations within 10 kb of a telomere insertion to detect these cases.

Candidate gene selection for correlation analysis.
A list of 1,725 telomere maintenance associated human genes was obtained from TelNet

Detection of telomere repeat variants.
Telomere repeat variants (TVRs) were detected by searching for hexamers of the type NNNGGG in the extracted telomere reads. Each base was required to have a base quality of at least 20. The neighboring 18 bp on either side of the TVR were determined. For further analysis, NNNGGG TVRs were once computed for arbitrary context and once for t-type context ((TTAGGG) 3 -NNNGGG-(TTAGGG) 3 , also called "singletons"). The absolute counts were normalized to the total number of reads in the sample.
The expected pattern counts at different telomere content tumor/control log2 ratios were taken from the regression line through telomerase-positive samples.

Principal component analysis (PCA).
Eight features were selected for the PCA: telomere content tumor/control log2 ratio, number of telomere insertions, number of break points and the distance of TGAGGG, TCAGGG, TTGGGG, TTCGGG and TTTGGG singleton repeats (i.e. repeats in a t-type context) to their expected occurrence. The PCA was performed on the selected features of the ATRX/DAXX trunc and TERT mod tumor samples. Three samples were excluded due to missing data, leaving 327 samples for the analysis.

Classifier for predicting active telomere maintenance mechanisms.
A random forest classifier to distinguish ATRX/DAXX trunc and TERT mod samples was built using the R packages "randomForest" 50 and "caret" 51 with the same features as for the PCA. To deal with the imbalance in the data set (i.e. 254 TERT mod samples vs. 73 ATRX/DAXX trunc samples without missing data), the model was trained with a down-sampled training set. The performance was determined using 10-fold cross-validation.

Statistics.
Differences between ATRX/DAXX trunc and TERT mod samples in terms of telomere content, percent break points with telomere insertions and singleton repeat abundance were tested using two-sided Wilcoxon rank-sum tests. Singleton repeat abundance p-values were corrected for multiple testing using the Bonferroni method. To reduce the influence of outliers, correlation coefficients were calculated with the Spearman method. Correlation between control telomere content and age as well as tumor and control telomere content was tested with linear regression. All statistical analyses were carried out using R (R Foundation for Statistical Computing).    0  0  5  5  10  10  15  15  20  20  25  25  30  30  35  35  40  40   G A C T CC T G AA T C T T T TG T TGG C A T T TG C T TC T G T T TG CC T AAG C A C A T T TG T AAA T G T C T AAG C T TC AAA T TAAG AAG C T A T G T GG T TG C T A CC T G A T TA T A T T TA G T C AA T AAAC T TA C T T TC A T A T T TG T T T   (e) Copy number changes of adjacent segments accompanying telomere insertions. "Complex" means that the copy numbers between segments differ in more than four copies. Overlaps with regions of chromothripsis are indicated. For telomere insertions that did not overlap with regions of chromothripsis, structural variations or additional telomere insertions within 10 kb are indicated. h2h = head-to-head, t2t = tail-to-tail.   It is known that telomeres elongated by telomerase have a homologous length with few TVRs in distal telomeric regions (left), while ALT telomeres have heterogeneous lengths with an increased amount of TVRs (right). Moreover, ALT cells have abundant extrachromosomal telomeric sequences. From this study, we conclude that the chromosomes of ALT cells have a higher number of aberrant interstitial telomere insertions, most of which are one-sided and accompanied by a loss of the adjacent chromosomal segment. We also showed that several TVRs occurring as singletons are more abundant in ALT telomeres, while one singleton (TTTGGG) was more abundant in telomerase-elongated telomeres. Please note that it is currently undetermined whether the different types of singletons are located in proximal or distal telomeric regions.   The number of homologous bases between the canonical TTAGGG telomere repeat and the human reference genome at telomere insertions is shown on the x-axis. The number of telomere insertions with a pattern of TTAGGG microhomology (red), blunt-end DNA joining (yellow) or without TTAGGG repeats at the junction site (green) are shown on the y-axis.    TGAGGG  TCAGGG  TTGGGG  TAAGGG  GTAGGG  CATGGG  TTCGGG  CTAGGG  TTTGGG