Identification and characterization of abundant repetitive sequences in Allium cepa

Species of the genus Allium are well known for their large genomes. Allium cepa is of great economic significance. Among vegetables, it ranks second after tomato in terms of the global production value. However, there is limited genomics information available on A. cepa. In this study, we sequenced the A. cepa genome at low-coverage and annotated repetitive sequences by using a combination of next-generation sequencing (NGS) and bioinformatics tools. Nearly 92% of 16 Gb haploid onion genome were defined as repetitive sequences, organized in 162 clusters of at least 0.01 percent of the genome. Of these, a proportion representing 40.5% of the genome were further analyzed in detail to obtain an overview of representative repetitive elements present in the A. cepa genome. Few representative satellite repeats were studied by fluorescence in situ hybridization (FISH) and southern blotting. These results provided a basis for evolutionary cytogenomics within the Allium genus.

Based on advances in next-generation sequencing (NGS), de novo assembly and annotation methods, repetitive sequences can be studied effectively at reasonable costs by combining low-pass NGS 12 and graph-based clustering analysis 13,14 . Differences in genome sizes between Arabidopsis (1 C = 150 Mb) and A. cepa (1 C = 16 Gb) are mainly due to the amplification of repetitive DNA sequences. Repetitive sequences include tandem repeats (satellites, minisatellites and microsatellites) and transposable elements (TEs) 15 . As in all other eukaryotes, TEs in plants are categorized as class I (retrotransposons) or class II (DNA transposons) transposons. Class I (retrotransposons) contains all TEs that transpose via an RNA intermediate in a "copy-and-paste" process. Class II DNA transposons transpose through a DNA intermediate via a "cut-and-paste" mechanism, usually maintaining a moderate copy number in the genome 16 . In this study, we sequenced the A. cepa genome at low-coverage, identified and characterized its most abundant repetitive sequences and determined the chromosomal localization of a few repeats.

Materials.
A. cepa cv. Fuxing was obtained from Institute of Economic Crops, Hubei Academy of Agriculture Sciences, China.
DNA extraction for NGS. We collected leaves and roots from the greenhouse grown seedlings and carried out DNA extraction using a DNeasy plant mini kit from Qiagen.

NGS.
A sequencing library was prepared using NEB Next ® Ultra ™ DNA Library Prep Kit Illumina (New England, Biolabs, Ipswich, MA, USA). Paired-end sequencing (2 × 150 bp, 350-400 bp insert size) of total genomic DNA was performed by iGeneTech Co. Ltd. (Beijing, China) on the Illumina HiSeq2500 platform on a single lane. Clean sequencing data were supplied in FASTQ format without adapters.
Repeat explorer. The RepeatExplorer pipeline 14 (https://repeatexplorer-elixir.cerit-sc.cz/galaxy/) was performed to cluster NGS raw reads into groups of similar reads with default setting. Repeat clusters with genome proportions of no less than 0.01% were selected for further analysis. Repeat clusters with known protein domains can be classified by RepeatExplorer pipeline directly. Other clusters were subjected to manual analysis with similarity searches against GenBank databases (Nt and Nr) using Blastn and Blastx 17 with an E-value of 1e −5 .
PCR amplification, cloning, and sequencing of AceSat01-377 and AceSat02-750. Amplification of the determined repeats was performed by using specific primers (Table 1)  Chromosome preparation. The chromosomes are prepared as mentioned by He et al. 18 . Root tips of A. cepa cv. Fuxing were collected when they reached 0.5-1 cm. Mitosis was blocked in αbromonaphthalene avoiding light at room temperature for 4 hours, followed by fixing in 4% (w/v) paraformaldehyde, stored at 4 °C for 40 min. After that the root tips were digested by 2% cellulase and 2% photolyase for 60 min at 37 °C. The digested root tips were homogenized in 60% acetic acid solution and dripped on glass slides. The prepared slides were dehydrated with an ethylalcohol series.
Fluorescence in situ hybridization (FISH). pTa71 of 45 S rDNA was used in the present investigation.
The repeated DNA sequences used for FISH were labelled by nick translation with RED: Texas Red-12-dUTP (Invitrogen C3176) and GREEN: Fluor 488-5-dUTP (Invitrogen C11397) FISH was performed as described 18 . The slides were treated in 70% formamide in 2 × saline sodium citrate (SSC) for 5 min at 90 °C. Simultaneously, 2 mg/ml probes and 1 mg/ml sheared salmon sperm DNA were pre-mixed in 2 × SSC with 10% dextran sulphate and 50% deionized formamide, denatured at 80 °C for 5 min and immediately cooled in ice water. After dehydrating and air drying, the slides were incubated in denatured hybridization solution overnight at 37 °C. Nuclei and chromosomes were stained with 4, 6-diamidino-2-phenylindole (DAPI, 0.2 mg/ml, Sigma, Deisenhofen, Germany) and observed under an Olympus BX-60 fluorescence microscope. Images obtained using a CCD monochrome camera Sensys 1401E were pseudo-colored and processed with the Metamorph imaging system (Universal Imaging Corp., PA, USA. version 4.6.3) and Adobe Photoshop 9.0 software.

Results and Discussion
Genomic repeatome composition. Because of its large genome size of about 16 Gbp/1 C, it is difficult to analyze the repeat composition of the whole onion genome by traditional molecular methods 20 . Thus, we used the latest NGS technology and RepeatExplorer computational pipeline to reveal the genome structure. A total of 15,990,607 clean paired-end reads were obtained with a length of 150 bp each. Illumina sequencing can lead to a bias at the beginning and the end of reads 21 . Therefore, we trimmed the 10 bp on both ends of each read. Reads with quality score > = 10 over 95% of bases without Ns were further analyzed. Only complete read pairs (interlaced reads) were used. The RepeatExplorer pipeline revealed us to use 0.01-0.5 × genome coverage reads for analysis (https://repeatexplorer-elixir.cerit-sc.cz). To maintain a balance between high sensitivity and moderate running times with the available computational resources, we decided to use 2,642,364 reads representing ca. 2.16% of the genome for clustering in RepeatExplorer pipeline. The RepeatExplorer pipeline arranged 2,419,056 reads in 537,598 clusters, and nearly 92% of the genome were found to be repetitive sequences (Fig. 1). Top 162 clusters of not less than 0.01% of the genome comprised ~40.5% of the genome. The total repeat composition is similar as previously reported 5 . Until now, only a few giant genomes have been analyzed for repetitive DNA composition; most of them are composed of highly heterogeneous groups with relatively low abundance of repeat-derived DNA. For example, in the Australian lungfish (Neoceratodus forsteri) genome (~50 Gbp/1 C), only 40.2% can be assigned to recognizable repetitive DNA 22 ; in the black salamander (Aneides flavipunctatus) with a genome of ~44 Gb, less than 50% can be assigned to known TEs 23 . The genome size of diploid Fritillaria species varies between 30.15 and 85.38 Gb; about 42% of the genome was assigned to known TEs in F. imperialis 24 . Our results suggested that unlike other giant genomes, the genome of A. cepa seems to be more similar to smaller genomes because very large genomes are usually derived from massive amplification of a small number of LTR retrotransposons 25 . The 162 clusters of at least 0.01 percent of the genome represent together 40.5% of the genome and were further annotated ( Fig. 1 and Table 2). No coding genes were found in these clusters except rDNA, mobile elements and plastid genes (ca. 1.05%). The proportion of repeat types within the A. cepa genome was shown ( Table 2). The most abundant repeats are LTR-retrotransposons, including 14.227% Gypsy and 3.569% Copia elements. The genome also consists of 8.599% of low complexity repeats and 8.393% of unknown repeats, which may be due to lack of sufficient annotated sequences from close related species in public database. In addition, there were 1.912% of simple sequence repeats (4 clusters), 1.421% of satellite sequences (3 clusters), 0.581% of rDNA (1 cluster), 0.528% of DNA. CMC.EnSpm (3 clusters) and 0.22% of LINE.L1 repeats (1 cluster). There are ca. 50.049% of the tiny repeat clusters composing ca. 8 Gb DNA sequences in A. cepa, which stay undetermined due to the RepeatExplorer threshold for clustering. The question if they come from the mainstream repeats as degenerative copies or if they are unrelated to them stays to be answered. Hertweck and Bainard used de novo repeat assembly methods (MSR-CA) rather than graph-based clustering methods (RepeatExplorer) and assembled Allium fistulosum repeats by low coverage single-end reads. And annotated repeats are about 9% of the genome, which are likely underestimated 26 . cepa, A. sativum, and A. ursinum) and defined 60% of genome of A. cepa represented by repetitive sequences 11, which is much lower than the present findings (ca. 92%). We selected 600,000 reads from A. cepa NGS data released by Peška et al. (2019) and did the co-clustering with 600,000 reads from current study. And we confirmed the difference came from these two datasets themselves NGS and graph-based clustering analyses provide high-throughput tools for detecting satellite DNA 28,29 . As suggested by Ruiz-Ruano et al. 29 , the satDNA terminology should begin with species abbreviation in Repbase (e.g. Ace for Allium cepa) followed by the term "Sat", a catalog number in order of decreasing abundance (according to the first genome analyzed), followed by consensus monomer length. Therefore, we termed two satellites as AceSat01-377 for cluster 7 (AcCL7) and AceSat02-750 for cluster 43 (AcCL43). Such satellite DNA usually showed a globula-like (AceSat01-377) or ring-like (AceSat02-750) graph 30,31 (Fig. 2A). The AceSat01-377 and AceSat02-750 clusters comprised of 1.24% and 0.17% of the genome, respectively. The monomer length of satDNA sequences ranges from 150-400 bp in majority of plants and animals 15 . The length of the monomers for AceSat01-377 and AceSat02-750 are 377 bp and 750 bp, respectively (Table 1). These two repeats have been cloned and sequenced (GenBank accession numbers are MH017542 for AceSat01-377 and MH017541 for AceSat02-750). AceSat01-377 showed high similarity with a satellite commonly found in Allium species 32 and low diversity among monomers. For AceSat02-750, BLASTn analysis revealed no significant match against National Center for Biotechnology Information (NCBI) databases (e-value = 10 −5 ), suggesting that this might be a novel satellite repeat.

Repeat type
Genome proportion (%) Total in analyzed clusters 40.5 Small clusters that were not analyzed 51.049 Non-clustered reads 8.451 Total 100 Table 2. Repeat composition of Allium cepa genome estimated from the Illumina sequencing data. www.nature.com/scientificreports www.nature.com/scientificreports/ PCR with AceSat01-377 and AceSat02-750 primers with genomic DNA of A. cepa resulted in ladder-like PCR products (Fig. 2B, S2 and S3), confirming the tandem organization of these repeats in A. cepa. The AceSat02-750 clone was used as a probe for southern hybridization of genomic DNA. After EcoRI and XbaI digestion, AceSat02-750 revealed a ladder-like pattern ( Fig. 3 and S3B), confirming its organization in the form of tandemly repetitive sequences.
Chromosome localization of AceSat01-377 and AceSat02-750. SatDNA sequences were located at heterochromatic regions, which appear not only at the centromeric and subtelomeric regions of the chromosomes, but also at intercalary positions. FISH with AceSat01-377, AceSat02-750 and 45 S rDNA was carried out on the metaphase chromosomes of A. cepa to investigate their chromosomal distribution (Figs 4 and 5).  www.nature.com/scientificreports www.nature.com/scientificreports/ 45 S rDNA localized on three chromosomes ( Fig. 4A and S5). In most cases in A. cepa, there are two pairs 45 S rDNA loci reported 8 . However, in A. cepa, two, three, or four loci of 45 S rDNA would be expected, which might be due to mobility of NOR 33 . The AceSat01-377 clone hybridized at sub-terminal regions at both ends of only one pair of chromosomes, and labeled only one end of rest of chromosomes except two chromosomes (Figs 4B and 5). The heterozygous signals of AceSat01-377 for each chromosome are strong enough and it is unlikely one chromosome is unlabeled while the homologous is labeled due to technique issue. In addition, the heterozygous of rDNA and repeats are reported in Allium cepa 33 11 . The AcepSAT356 and AcepSAT750 are similar to present AceSat01-377 and AceSat02-750, respectively, but the differences of FISH pattern in these two studies suggested the cultivar difference of these two repeats in A. cepa. AceSat02-750 occurred distal to AceSat01-377 at one end of three chromosomes (Figs 4B and 5). Taking together, we could identify 6 out of 16 chromosomes in A. cepa combining these three probes (Fig. 5). However, we failed to get FISH signals with other candidates (Table S1). Possibly, they are clustered in small groups, which are not sufficient to yield unambiguous FISH signals.

Data availability
All the data pertaining to the present study have been included in tables and/or figures in the present manuscript and the raw reads of sequencing data have been uploaded on the NCBI SRA database. The output of RepeatExplorer archive has been uploaded on the Figshare.com (https://figshare.com/s/b5adf97d66269b0369bc).