Ongoing human chromosome end extension driven by a primate ancestral genomic region revealed by analysis of BioNano genomics data

The majority of human chromosome ends remain incomplete due to their highly repetitive structure. In this study, we use BioNano data to anchor and extend chromosome ends from two European trios as well as two unrelated Asian genomes. Two thirds of BioNano assembled chromosome ends are structurally divergent from the reference genome, including both deletions and extensions. The majority of extensions are homologous to sequences on chromosome 1p, 5q and 19p. These extensions are heritable and in some cases divergent between Asian and European samples. We identified two sequence families in these sequences which have undergone substantial duplication in multiple primate lineages, leading to the formation of new fusion genes. We show that these sequence families have arisen from progenitor interstitial sequence on the ancestral primate chromosome 7. Comparison of chromosome end sequences from 15 species revealed that chromosome end divergence matches the corresponding phylogenetic relationship and revealed a rate of chromosome extension since the primate divergence of 80-440 kbp per million years.


Introduction
The genome sequence of chromosome ends in the reference human genome assembly are still incomplete. In the latest draft of the 2 human genome 1 21 out of 48 chromosome ends were incomplete; amongst which five chromosome ends (13p,14p,15p,21p,22p) 3 are completely unknown and the remaining chromosome ends are capped with 10kb-110kb of unknown sequence. There are 4 many interesting observations in the chromosome end regions which remain unexplained, such as the observed increase in 5 genetic divergence between Chimpanzee and Humans towards the chromosome ends 2 . 6 Chromosome ends contain telomere sequences and subtelomeric regions. Most human chromosome subtelomeric regions 7 are duplications of other chromosome subtelomeric regions arranged in different combinations, referred to as subtelomeric 8 duplications(STD). STD are highly divergent between species or even different populations of the same species 3, 4 and 9 have experienced rapid adaptive selection 3 . Subtelomere length polymorphism is also found in humans 5 . The majority of 10 subtelomeric duplications have the same orientation towards the chromosome end 3,4 . Based on this it has been suggested that 11 they originated from reciprocal translocation of chromosome tips and unbalanced selection 4 . 12 Telomere repeat sequences ([TAAGGG] n ) -which are the capping sequences of chromosome ends -are breakable, acquirable 13 and fusible in the genome. In somatic cells, telomeres are observed to progressively shorten 6,7 . If the telomere sequence is lost, 14 the broken chromosome will become unstable [8][9][10] , and multiple types of rearrangements can occur, including chromosome 15 fusion 8 , tips translocation 11 , or direct addition of telomere repeats 10 . The manual insertion of telomere sequence in interstitial 16 region results in enhanced chromosome breakages and induces high rates of chromosome rearrangements around the insertion 12 . 17 Interstitial telomeric sequences (ITS) are widespread in the genome 13,14 . In subtelomeric regions, their orientations are almost 18 always towards the terminal end of the chromosome, like the STD 4 . 19 The quality of assembly for the chromosome ends largely depends on the sequencing technology. To correctly assemble 20 highly duplicated regions like chromosome ends, sequence reads or read pairs spanning the repeat are required 15 . Recently, 21 the NanoChannel Array (Irys System) from BioNano Genomics 16 was introduced. This technology can generate barcodes on 22 DNA fragments which are hundreds of kilobases long by detecting the distance between specific enzyme recognition sites. 23 Alignment and genome assembly are performed based on numerous distinct site distance fragments. These very long fragments 24 enable construction of individual physical maps, as well as completing the reference in unknown regions 17 . In this manuscript, 25 we report on observed subtelomeric dynamics using this technology for the first time. We conclude that these genome dynamics 26 reflect ongoing chromosome extension and deletion and identify genomic regions which have undergone substantial extension 27 in multiple primate lineages. 28

29
Assembling chromosome ends using BioNano data supports updates to GRCh38 reference 30 We downloaded data from BioNano Genomics(see Methods). The data contains eight samples, including two family trios 31 (Ashkenazi, CEPH) and two Chinese samples. The raw data has been assembled into contigs (with N50 of 3.3MB) and aligned 32 to the GRCh37 reference. We used the most distal unique aligned sequences at the chromosome ends to anchor individual 33 chromosome ends (see Methods). Overall, 19 reference chromosome ends were extended from the GRCh37 reference ( Figure 34 1, Supplementary Figure 1). The extension sequences supported a re-orientation of chromosome 2q ends in the new GRCh38 35 reference (Supplementary Figure 2a). We observed extension sequence at 9q in one family which matched the GRCh38 36 extension of this chromosome(Supplementary Figure 2b). We also observed a 260kb extension sequence in one sample at 16p 37 terminal which was originally discovered in 1991 5 (Supplementary Figure 1.31).

38
Discovery of heritable chromosome end polymorphism 39 We revealed characteristics of chromosome ends by analysis of completed chromosome ends of the eight samples. We focused 40 on autosomes in this study. Five unknown heterochromatic acrocentric chromosome short-arms (13p, 14p, 15p, 21p and 22p) 41 were excluded from analysis. The chromosome 1p end was also excluded because of the discontinuous alignment of the 42 BioNano contigs at a reference gap(Supplementary Figure 1.1). For the remaining 38 chromosome ends, we compared them to 43 the corresponding GRCh37 chromosome ends (Method, Figure 1, Supplementary Figure 1).

44
Chromosome end polymorphism was frequently observed versus the GRCh37 reference ( Fig.1 and Supp. Fig. 1 (unaligned,1p,19p,16q) are observed in 9q arm amongst 8 samples (Figure 1a). Chromosome end 50 polymorphism are widespread amongst all samples studied (Supp. Fig. 1). We observed that the daughter chromosome end 51 sequences (both deletion and extension) are almost always (93.4%, 71/76) also observed in one of her parents, indicating that 52 the chromosome end data are unlikely to be a result of an assembly artefact.

53
Inferring the origin of extension sequences 54 We recognized the origin of the extension sequences by re-aligning the BioNano labels (enzyme recognition site) patterns 55 to GRCh37 reference. Most of the extension sequences were unaligned due to relatively short size and the 10kb density of 56 enzyme recognition sites (average length of unaligned and aligned contigs was 53k and 97k respectively). All the aligned 57 extension sequences are aligned to one of 1p, 3q, 5q, 16q or 19p. 58 We identified (see Methods) two sub-telomeric duplication families (family A and B of length 9kb and 8kb respectively) 59 which both have 16 copies in the human genome. Family A is observed in all of the 1p, 3q, 5q, 16q, 19p extension sequence 60 regions; whereas family B is only observed in 1p,3q 16q and 19p. We identified all homologous sequences to these two families 61 in the non-redundant nucleotide database which we used to construct a phylogenetic tree for each family(see Methods, Figure 62 2a,2b, Supplementary table 3,4). These duplications were only found in close species and were highly duplicated in human 63 (A:16,B:16 copies) and chimpanzee(A:9,B:7 copies). Both trees are clustered into 3 parts: namely i. interstitial duplication 64 across multiple species, ii. human subtelomeric duplication and iii. chimpanzee subtelomeric duplication. The interstitial 65 duplications are clustered between species on a longer branch, suggesting they form an orthologous group. We conclude that 66 the subtelomeric cluster duplicated from an ancient interstitial sequence in the genome millions years ago (see Discussion). 67 The human and chimpanzee subtelomeric copies do not cluster together, suggesting these subtelomeric copies are recently and 68 independently duplicated in each species.

69
Identification of the ancient shortening chromosome end telomeric sequence 70 There are multiple interstitial telomeric sequences (ITS) in the human genome which are orientated in the same direction 71 at subtelomeric regions 4 . For example, 6 telomere sequences are in the first 110kb of 18p ( Figure 3a). The reciprocal tips 72 translocation model 4 for subtelomeric duplication does not involve the chromosome terminus, nor create new telomere repeat 73 sequence, and thus we are unaware of any mechanism for explaining these sequences. We investigated the relationship between 74 all chromosome end ITS and duplications of 1kb or more (Supplementary Figure 3, Supplementary Table 5, Method). We 75 found that all ITS are either fully contained within a duplication, or within 50bp of a duplication (Figure 3d). The vast majority 76 overlap or are next to a duplication on the distal side of the ITS (16 sites) rather than proximal (3 sites, of which each site also 77 has a distal-side duplication overlap, see Supplementary Table 5). The dominance of distal side ITS duplication suggests that 78 duplication events occurred at the terminal-end of the ITS, and that the current subtelomeric ITS were actually the ancient 79 chromosome end telomere sequence. We also observed multiple homologies to different regions with variable degree of overlap 80  Figure 2. Phylogenetic trees for two sequence familes identified in human extension sequences a) Family A ( 9kb) b) Family B ( 8kb). Phylogenetic trees are generated from their homologous sequences. H for Homo sapiens, P for Pan troglodytes, Gor for Gorilla and Pon for Pongo abelii. p for p arm, q for q arm, Un for unplaced contig, un for unlocalised within a chromosome and i for interstitial genome regions. '+' for forward aligned and '-' for reverse aligned. '.A','.B','.C' and '.D' are appended to distinguish different copies in the same region. The brackets indicate clustering into the 3 groups: interstitial; human subtelomeric and chimpanzee subtelomeric. Bootstrap value are showed on each branch.
with the ITS (for example between 10 bps to 168 bps at position 105kb on chr18p, see Figure 3c). This suggests that duplication 81 of chromosome terminal sequence has taken place independently multiple times, and is likely mediated by telomeric homology. 82 We conclude that multiple directly duplicated events at ancient chromosome ends have generated multiple present day ITS 83 sequences at subtelomeric regions.

84
The mean size of subtelomeric ITS (336 bps) is much shorter than capping telomeric sequence. When subtelomeric ITS 85 are used as chromosome end capping telomeres, they create a dysfunctional chromosome end 11 . There are multiple ways to 86 repair the dysfunctional chromosome end 10,11,20 20 , which 89 can be seen from a characteristic inverted interstitial telomere sequence. Telomere addition to telomere is indistinguishable 90 from common telomere shortening and lengthening unless non-telomeric sequence is also involved in the addition. A common 91 observation for ITS or capping telomere is that is has TAR1 (telomere associated repeat 1) element inside, and furthermore the 92 proximal telomere identity is lower than the distal telomere identity (Figure 4d). This suggests that the ancient telomere broke 93 and a new telomere with TAR1 was added. The duplications of other subtelomeric regions to the shortening ITS are the relics 94 of duplication or translocation of other ends to dysfunctional chromosome end. These genome observations are identical to all 95 observations from in-vitro telomere repair models 10,11,20 , suggesting that joining sequence to chromosome ends could occur 96 spontaneously as a result of repairing the dysfunctional chromosome ends both in-vitro, as well as in vivo in our ancestors.

97
Population genetics at chromosome ends 98 We used the 1000 genome 21 data to estimate average genetic diversity at chromosome ends (See methods) [ Figure 5, 5d]. 99 From these data, we found the diversity decreases sharply in the first 500kb of the p-arm and the last 500kb of the q-arm. 100 This is driven by a decline in diversity in subtelomeric duplications in these regions[ Figure 5c,5f], but not the remaining 101 non-duplicated chromosome end regions [ Figure 5b,5e]. Diversity is typically correlated with divergence. However, the 102 chimpanzee genome project reported that the divergence between human and chimpanzee constantly increases toward the 103 chromosome end, suggesting the chromosome end is the most divergent sequence between human and chimpanzee 2 . Thus 104 chromosome ends are regions with the low diversity and high divergence which is consistent with a model of chromosome 105  IV) The homologous sequence is overlapping the proximal breakpoint. V) ITS is next to (<50bp) ITS proximal breakpoint VI) No homologous sequence is observed. * means updating the orientation of 2q and 12p ITS as GRCh38.   The comparison of other species chromosome end to human can not only verify the extension hypothesis but also estimate 120 the extension rate. We downloaded pairwise alignments for GRCh37 to 15 well-assembled species from UCSC 19 . 39 well-121 assembled autosome ends as well as two ancient chromosome 2 fusion ends 20 were analyzed (see Methods). By aligning 122 other species to human chromosome ends, we could identify the ancient chromosome end for the most recent common 123 ancestor(MRCA) of human and this species as the most terminal end of the homology and therefore the missing homologous 124 sequence can be defined as human extension sequence since the MCRA (see Methods, Supplementary  (Figure 1c,1d). Extension sequences are found not only in human but also in other 129 species when these homologous sequences are also their current chromosome ends, for example, cat D4q, dog 9p and horse 130 25q (Figure 1c). The extension sequences for human and other species, together with the identical ancient chromosome ends 131 confirm the ongoing extension of chromosome ends.

132
The length of human-specific extension sequences represents a near linear relationship with MRCA time 2,18,[23][24][25] (Table 133  1), and are consistent with the accepted phylogenetic tree. One exception is that we identified 783kbps of human-specific 134 7/12 chromosome extension sequence versus gorilla, whereas we identified 1744kbps versus chimpanzee, which runs counter to 135 the common belief that human is more closely related to chimpanzee than gorilla. However, this may be resolved by the the 136 observation that 30% of the gorilla genome sequence is closer to human or chimpanzee than the latter are to each other 23 . 137 We estimated the human extension rate by dividing the total length of extension sequences by the estimated time since 138 the most recent common ancestor (MRCA) (see Methods, Table 1).The extension rate is more accurately estimated by close 139 species such as primates. Assuming the GRCh37 represents the human full chromosome sequences and the pairwise alignment 140 results represent the ancient chromosome ends, the average human autosome ends (n=41) extension rate since the MRCA with 141 other primates is between 78 and 436 kbps per million years. In this study we provide evidence that chromosome ends are dynamic and have undergone substantial extension since divergence 144 of primates. We show that the process of chromosome-end extension is ongoing and generates significant polymorphism in the 145 population, which is heritable from one generation to the next. 146 We identified two duplication families which are participating in ongoing chromosome end expansion. These families 147 appear to both originate from the same region on ancient primate chromosome 7(GRCh38, chr7:45793082-45823926), separated 148 only by 13kb; however they have undergone distinct patterns of duplication in subtelomeric regions of at least two primates 149 (Chimpanzee and Human). These families are still participating in subtelomeric duplication in human populations, as can be 150 seen by alignment of these families to observed BioNano chromosome extension sequence. We showed that duplication of 151 these families in sub-telomeric regions has lead to the formation of new fusion genes. This may provide a mechanism for the 152 extension sequence to be adaptively selected and eventually become fixed in the population. We also showed that formation of 153 extension sequence has resulted in genetic diversity within subtelomeric duplications.

154
Our analysis indicated that many subtelomeric duplications have been mediated by subtelomeric interstitial telomeric 155 sequence (ITS), and that the duplications occur on the distal side of these ITS. This indicates that the interstitial telomeric 156 sequence are the ancient chromosome ends, and that duplication occurred via a process of fusion to the capping telomere at 157 the chromosome end. Moreover, the observed extensions in the BioNano sequence data appear to be compatible with this 158

8/12
hypothesis, although the current resolution of this approach is too large to be conclusive. 159 Normally, duplication of a subtelomeric region into another subtelomeric regions will never involve interstitial sequence, 160 thus the interstitial origin of sequences at subtelomeric regions indicate an unknown mechanism. Interestingly, two cases 161 of direct interstitial sequence joining to telomere sequence were described in a human cancer fusion study 9 . One was a 447 162 bps interstitial chromosome 4 sequence joining to 10q telomere at one side and 4p telomere at another side. The other was a 163 211 bp interstitial chromosome 2 sequence and a 374 bps chromosome 17 sequence joining to 4q telomere and Xp telomere, 164 respectively. These cases in cancer support the ability of interstitial sequence to join to the telomere. A possible model is that 165 the interstitial sequence is deleted and then directly joined to the telomere, similar to the 'deletion-plus-episome' model for 166 double minutes in cancer 26, 27 . 167 There are several potential mechanisms we might consider for annealing of chromosome sequence to the capping telomere 168 sequence 4 : non-allelic homologous recombination (NAHR), non-homologous end joining (NHEJ), microhomologous-mediated 169 end joining(MMEJ) 28 and single strand annealing(SSA) 28 . The presence of overlapping homology at the breakpoints suggests 170 this mechanism is homology mediated. Moreover, the co-location of the breakpoints with interstitial telomere sequence 171 indicates that the process may be partially mediated by micro-homology of the telomeric repeat sequence. Both MMEJ and 172 SSA are also consistent with the observation that all interstitial telomere repeat sequence are in the same orientation.

174
Bionano data analysis 175 We downloaded publicly available BioNano data for eight humans (a CEPH trio, an Ashkenazi trio and two Han Chinese) 176 from BioNano Genomics website http://bionanogenomics.com/science/public-datasets/. The initial 177 downloaded data contained raw Bionano data, assembled contigs as well as the unique alignments from contigs to GRCh37 178 generated using RefAlign from BioNano Genomics IrysSolve. This alignment mapped each Bionano contig to the best matching 179 position on GRCh37, and did not require a full-length alignment of contig to reference, thus allowing us to identify chromosome 180 end polymorphism. We assumed that contigs which are aligned to the most distal sequence with highest matching score 181 represent the individual sample chromosome ends. As the technology is currently unable to resolve different chromosome end 182 haplotypes, the chromosome ends can be viewed as a dominant marker for the longest allele.

183
The reference chromosome end sequences which could not be aligned by are defined as deletion sequences. The reference 184 unknown sequences ("N" region) are removed for deletion size estimation. The unaligned sequence at the distal part of the 185 individual chromosome end is defined as extension sequence. Deletion and extension sequences are all more than 8.9 kb(average 186 labels distance), otherwise, we defined as the reference type. Chromosomes 13p, 14p, 15p, 21p, 22p were removed because of 187 missing reference sequence. Sex chromosomes are removed because they are unequal between sample. 1p is removed when all 188 the sample's contigs are discontinuous at the reference unknown region (chr1:471k-521k) and the remaining of 1p could only 189 be aligned as secondly alignment.

190
In order to recognize the origin of the extension sequences, we realigned the contigs to GRCh37(373,590 labels) allowing 191 multiple matches by RefAlign. A pre-alignment process in RefAlign was used to merge labels which were close to each other. 192 This process resulted in identification of 346,991 labels for subsequent analysis. The mean distance and standard deviation 193 between adjacent labels was 8.9 kb and 100 kb, respectively. We then used RefAlign to re-align the contigs to the GRCh37 194 reference enzyme recognition sites (RefAlign parameter: -output-veto-filter intervals.txt$ -res 2.9 -FP 0. . This re-alignment allowed us to more accurately identify chromosome ends. In particular, for 6p 199 in sample NA12878, the multiple alignments supported a more distal alignment (chr6:73k-364k) than the unique alignment 200 (chr6:245k-1870k). The break of these two contigs was mediated by a connection from 6p to a chromosome 16 interstitial 201 region, which could be an artefact. Similarly, for 16p in NA24385, the multiple alignments supported a more distal alignment 202 (chr16:67k-204k) than the unique alignment (chr16:204k-3728k). The more distal contig could align to both 19p and 16p with 203 no overlap, but the alignment at 19p(chr19: 61k-244k) is longer than 16p making it primarily align to 19p in unique alignment. 204 Notably, the contig with highest matching score for 19p in this sample is another contig(chr19:61k-2341k). 205 We also realigned individual contigs to other contigs with the same parameter as above. When a label in one sample was 206 aligned to another sample label, we regarded these two labels were connected. For duplication family A and B, we extracted the 207 label IDs from GRCh37 label map. Family A contained 3 to 4 labels. Family B contained 2 labels. For the alignment regions 208 containing family A or B, we only drew these regions when at least 3 or 2 labels were aligned to GRCh37, respectively.

209
For the CEPH and Ashkenazi trios, we checked the alignments from the child to their parent for every chromosome 210 end(Supplementary Figure 1). If the child chromosome end was aligned to one of its parent with no more than one label 211 difference at the terminal, we considered that the child chromosome end was inherited from the parent. Subtelomeric duplication analysis 213 We downloaded all pairs of regions in the human genome (GRCh37) with high sequence identity from the segmental duplication 214 database 29 . We assign the copy number of a base-pair position to be the number of entries in this file which overlap this position. 215 We take all positions with copy number at least 22 to be high copy number duplication families. We order these families by 216 their base-pair length, defined as the longest contiguous stretch of DNA with copy number greater than 22. Only duplication 217 families which have at least one member within 1MB of a chromosome end, were kept (Supplementary Table 7). This resulted 218 in an identification of two duplication families with high copy number, which was selected for evolutionary analysis. 219 We used NCBI blast to align these two sequences (GRCh37,A:chr1:652579-662291,B:chr16:90220392-90228245) to the 220 human genome, chimpanzee genome and nucleotide collection (NT) database, filtering out sequences with less than 95%(human 221 and chimpanzee) or 85%(other) sequence identity to the full query sequence. Next, we extracted the duplications, aligned them 222 by mafft 30 [key option: strategy G-INS-i] and built the neighbor-joining tree by T-rex 31 [option:default].

223
Subtelomeric duplication regions are defined as the longest contiguous duplication sequence starting from the chromosome 224 terminal. From the duplication database 29 , both pairs of duplications regions inside subtelomeric duplication regions are 225 defined as subtelomeric duplications. The longest autosome subtelomeric duplication pairs are chr1:317720-471368 and 226 chr5:180744335-180899425. Sex chromosomes are excluded from this analysis because homology between the X and Y 227 chromosomes in p arm terminal is maintained by an obligatory recombination in male meiosis 32 .

228
Subtelomeric Interstitial telomere sequence analysis 229 Telomere sequences were annotated from GRCh37 repeatmasker database 33 . We extracted the non-capping telomere sequences 230 inside subtelomeric regions as subtelomeric ITS sequence. We also included the ancient subtelomeric region in the chromosome 231 2 fusion sites 20 . Then we searched for all the subtelomeric ITS and their adjacent duplications in duplication database 29 . Bases 232 on all the possible overlapping, we divided into 6 types. Type a is duplication spanning the whole ITS. Type b and type d are 233 duplications overlap with the ITS from the distal and proximal site, respectively. Type c and type e are duplications adjacent 234 to (<50 bps) ITS from the distal and proximal sites, respectively. Type f means there is no duplication adjacent to ITS. We 235 counted the number of duplications for each type for each ITS.

236
Fusion gene analysis 237 Gene information was obtained by zooming into specific genome regions from UCSC browser 19 . Genes which contain coding 238 sequence which lies within the duplication as well as coding sequence which lies outside the duplication it is considered as a 239 fusion gene. We extracted gene expression information from the Expression Atlas 22 recording the highest expression level in 240 Supplementary Table 3   . The base pair diversity is defined as h = 1 − ∑ n i=1 f i 2 , where f is allele frequency 246 and n is the number of observed alleles. We performed a local regression on diversity using the R function(loess.smooth 247 parameter: degree=10,span=0.3). Finally, we summarize the average diversity from all the chromosome for each bin as 248 ,where is total number of chromosomes. For analysis of subtelomeric duplications only, the 249 window size could not be pre-specified because the size of the regions are variable. As a result, for this analysis we specified 250 the total number of bins to be 150 for each chromosome subtelomere region. In order to calculate divergence between Human 251 and Chimpanzee we first aligned the human and Chimpanzee genomes using bwa 34 (default parameters). Divergence was 252 estimated as the percentage of unaligned base pairs within each 100bp window. The entirely unaligned regions are excluded for 253 this estimation.

254
Human extension rate analysis 255 We downloaded pairwise alignment files from UCSC3. These files contain region alignments from species to GRCh37. Initially, 256 21 genomes (Chicken, Chimp, Cow, Dog, Gibbon, Gorilla, Horse, Marmoset, Mouse, Orangutan, Rat, Rhesus, Sheep, Baboon, 257 Cat, Elephant, Kangaroo, Panda, Pig, Rabbit and Zebrafish) were analyzed. For each human autosome as well as ancient 258 chromosome 2A and 2B 20 , we sorted the alignments by human chromosome and location. We searched for the most terminal 259 end alignment. Because short alignments could result from common repeat elements and subtelomeric duplications. We only 260 selected the alignments longer than human longest subtelomeric duplication (154k) to represent ancient chromosome sequence. 261 Because sequence divergence and genome assembly quality will significantly affect the alignment length. Baboon, Kangaroo, 262 10/12