High-throughput analysis of the satellitome illuminates satellite DNA evolution

Satellite DNA (satDNA) is a major component yet the great unknown of eukaryote genomes and clearly underrepresented in genome sequencing projects. Here we show the high-throughput analysis of satellite DNA content in the migratory locust by means of the bioinformatic analysis of Illumina reads with the RepeatExplorer and RepeatMasker programs. This unveiled 62 satDNA families and we propose the term “satellitome” for the whole collection of different satDNA families in a genome. The finding that satDNAs were present in many contigs of the migratory locust draft genome indicates that they show many genomic locations invisible by fluorescent in situ hybridization (FISH). The cytological pattern of five satellites showing common descent (belonging to the SF3 superfamily) suggests that non-clustered satDNAs can become into clustered through local amplification at any of the many genomic loci resulting from previous dissemination of short satDNA arrays. The fact that all kinds of satDNA (micro- mini- and satellites) can show the non-clustered and clustered states suggests that all these elements are mostly similar, except for repeat length. Finally, the presence of VNTRs in bacteria, showing similar properties to non-clustered satDNAs in eukaryotes, suggests that this kind of tandem repeats show common properties in all living beings.


Supplementary Tables 13
| SatDNA families reported in Orthoptera before this study . . . . . . . . 13 it showed a significant tendency to be higher in the Northern genome (T= 472.5, P= 0.013). It is necessary to bear in mind that our analyses were made in a single individual per lineage, thus being intragenomic but not population estimates. Therefore, we cannot rule out that a given satDNA being absent in one of the two genomes analyzed might actually be present in other individuals from the same lineage. For instance, LmiSat62-23 was not bioinformatically found in the Southern genome, but it was observed by FISH in a different individual belonging to this same lineage (see Table 1).

Results S2 | Monomer length variation
The 58 satDNAs showed high variation for monomer length (8-
The total number of proximal, interstitial and distal loci did not differ significantly between short and long satDNAs (RxC: P= 0.170, SE= 0.006). Figure S1: Alignments between the different variants found for several long (a-j) and short (k-r) satDNA families.

South North
LmiSat01A-193  Figure S3: Minimum spanning trees for superfamilies 1, 2, 4 and 5 (a-d). In a-c, link size between haplotypes is proportional to the number of substitutions (s) and indels (id) (in d, links are also indicated as mutational steps). In brackets is indicated the sum of nucleotides involved in the indels. [Legend continues in the next page] Figure S3 [Continuation]: . The exclusive presence of these two satDNAs at a coincident distal location in the L2 chromosome suggests that SF2 arose in this chromosome and has not moved to other non-homologous chromosomes. This case illustrates how the differential accumulation between variants give rise to new satDNA families when similarity decreases beyond the 80% criterion. c) Superfamily 4 (SF4) includes two variants of LmiSat26-240 (240 nt) and a single variant of LmiSat37-238 and LmiSat51-241. All three satDNA families were interstitially located but on different chromosomes: S11, L1 and L2, respectively, with LmiSat37-238 showing a second cluster proximally located on S11. SF4 thus reflects how satDNAs move between non-homologous chromosomes. d) Superfamily 5 (SF5) included three short satDNAs (LmiSat31-8, LmiSat50-16 and LmiSat59-16) showing different location patterns: LmiSat31-8 is pericentromeric on S9 and S10, LmiSat50-16 is interstitial on S9, and LmiSat59-16 is non-clustered. Sequence alignment suggests that LmiSat49-16 and LmiSat58-16 families could have arisen from LmiSat31-8 through duplication ( Supplementary Fig. S4). However, a minimum spanning tree for these three families suggests that LmiSat50A-16 (which is abundant in both lineages) is the ancestral variant, and that LmiSat31-8 emerged in the Southern genome and LmiSat59-16 in the Northern one. In addition, the fact that simulated genomes of L. migratoria would contain, by chance, more than 200,000 copies of DNA motives identical to the three LmiSat31-8 variants (Supplementary Table S5), together with its exclusive presence in the Southern genome, suggests the possibility that this extremely short satDNA arose independently from the two other SF5 members in the Southern lineage. Likewise, LmiSat50-16 and LmiSat59-16 could represent a case of derivation of LmiSat59-16 from LmiSat50-16 in the Northern lineage, but the fact that simulated genomes included 6 and 4 copies, respectively, for both (Supplementary Table S5), and their different patterns of chromosomal location (clustered and non-clustered, respectively) throw some doubts on this possibility. Therefore, the reliability of SF5 needs additional analysis.
A C TC TG TG ----C TN -G TG A C V   A C TC TG TG ---A C TC TG TG   C TC TG TG A C T-C TC TG TG A C T   A C TC TG TG Figure S4: Alignment of LmiSat31-8 dimers and LmiSat50-16 and LmiSat59-16 dimers, all belonging to superfamily 5, showing how the two latter families could have derived from a dimer for the former satDNA.   TGG TTTCGCACAAGACAG TGG TTTCGCACAAGACAG TGG TTTCGCACAAGACAG TGG TTTCGCACAAGACAG   TGG TTTCGCACAAGACAG TGG TTTCGCACAAGACAG TGG TTTCGCACAAGACAG TGG TTTCGCACAAGACAG   CGCACAAGACAG TGG TTT   CACAAGACAG TGG Figure S7: Primer design and PCR amplification for long (a) and short (b) satDNAs. Note that long satDNAs (e.g. LmiSat01-193 shown here) show ring-shaped RepeatExplorer cluster graphs because read length is lower than monomer length. We designed divergent primer pairs, with nearby 5' ends, and tested them at 55, 60 and 65ºC annealing temperature. Dimer amplification was manifested at the highest temperature (a). Short satDNAs (e.g. LmiSat04-18 shown here) show spherical RepeatExplorer cluster graphs because monomer length is lower than read length. We designed divergent primers with the less stable extensive dimers. We obtained a delimited smear showing higher size with increasing annealing temperature (b).   Table S2: Length (bp), A+T content (%), abundance (% of the genome), number of repeats calculated as "[abundance x genome size (6.3 Gb)]/repeat length", and divergence (%) for all satDNA variants found in the gDNA libraries analyzed from Southern (SL) and Northern (NL) lineages.