The first draft genome of feather grasses using SMRT sequencing and its implications in molecular studies of Stipa

The Eurasian plant Stipa capillata is the most widespread species within feather grasses. Many taxa of the genus are dominants in steppe plant communities and can be used for their classification and in studies related to climate change. Moreover, some species are of economic importance mainly as fodder plants and can be used for soil remediation processes. Although large-scale molecular data has begun to appear, there is still no complete or draft genome for any Stipa species. Thus, here we present a single-molecule long-read sequencing dataset generated using the Pacific Biosciences Sequel System. A draft genome of about 1004 Mb was obtained with a contig N50 length of 351 kb. Importantly, here we report 81,224 annotated protein-coding genes, present 77,614 perfect and 58 unique imperfect SSRs, reveal the putative allopolyploid nature of S. capillata, investigate the evolutionary history of the genus, demonstrate structural heteroplasmy of the chloroplast genome and announce for the first time the mitochondrial genome in Stipa. The assembled nuclear, mitochondrial and chloroplast genomes provide a significant source of genetic data for further works on phylogeny, hybridisation and population studies within Stipa and the grass family Poaceae.


Scaffolding of contigs.
Nearly all contigs of S. capillata genome can be assigned to the reference chromosomes of Brachypodium distachyon L., Hordeum vulgare L. and Aegilops tauschii Coss., whereas genomes of Oryza sativa L. and especially Triticum aestivum L., have much less homology to the feathergrass assembly. In particular, 95.16% contigs of S. capillata genome were assigned to seven chromosomes of A. tauschii genome, 94.68% to five chromosomes of B. distachyon, 94.20% to seven chromosomes of H. vulgare, 89.92% to 12 chromosomes of O. sativa and only 41.67% to 21 chromosomes of T. aestivum. The total length of non-assigned contigs was reasonably low for A. tauschii (48. (Table 3). In addition, the RaGOO grouping confidence and orientation confidence scores per chromosome ranged from 57.81 to 76.11% and from 80.03 to 95.11%, respectively, indicating that the contigs could be placed on a chromosome with an acceptable level of confidence (Supplementary Table S2). The only exception is T. aestivum for which scores ranged from 30.49 to 47.76% for the grouping confidence score and from 57.81 to 70.19% for the orientation confidence score. Nevertheless, based on the location confidence score, the exact position of the contigs on a chromosome could not be accurately estimated, reflecting a low level of synteny to the reference genomes. In Table 1. Statistics of the nuclear genome assemblies.

Metrics
Flye assembly FALCON assembly Flye assembly after Purge Haplotigs    Table 4). The subsequent structural annotation of the masked genome revealed 53,535 nuclear genes (Supplementary File 1). On the other hand, the unmasked genome has 154,755 structurally annotated genes and 94,237 of them have BLAST hits in the NCBI non-redundant database. Nonetheless, among the 94,237 genes of the unmasked genome, 12,094 sequences are related to transposable elements. In particular, 2,925 genes associated with transposons, and 9,859 assigned to retrotransposons. In addition, 229 genes encode transposase-related proteins. Thus, except transposable elements the unmasked genome has 81,224 genes that can be associated with already known proteins (Supplementary File 2). SSR markers. In total, 77,614 perfect repeat motifs were identified for the nuclear genome assembly using Krait 75 (Supplementary File 3). Within those, di-and tri-nucleotides were the most common types, accounting 28,365 (36.55%) and 25,794 (33.23%) repeats, respectively. Tetra-nucleotide motifs were the third most abundant repeats with 9,777 SSRs (12.60%), followed by mono-nucleotides with 6,572 SSRs (8.47%) and penta-nucleotides with 4,629 SSRs (5.96%). Hexa-nucleotides were the rarest motifs with 2,477 SSRs (3.19%). Only four mononucleotide, four di-nucleotide and three tetra-nucleotide motifs were found in the mitochondrial and chloroplast genomes. However, a total length of those SSRs was in a range of 12-16 bp. In addition, in total 58 unique repeats present only in a single copy in a range 101-325 bp were retrieved from the analysis of TEs. Within those were four hexa-, 35 hepta-, nine octa-, five nona-and five deca-nucleotide motifs (Supplementary Table S3).
Divergence time of Stipa. The Bayesian phylogenetic reconstruction based on the five loci within NORs revealed the divergence time of Stipa from Brachypodium around 30.00-35.52 Mya and the putative origin of feather grasses about 2.90-6.02 Mya (Fig. 2). Although not all branches were well supported within the genus, the current analysis confirmed the monophyly of Stipa and the general grouping of the analysed species regarding their taxonomic positions. In particular, S. capillata and S. grandis represent the section Leiostipa Dumort; S. magnifica Junge, S. narynica Nobis, S. lipskyi Roshev. and S. caucasica Schmalh. belong to the section Smirnovia Tzvelev. The remaining three groups include (1) S. orientalis Trin. and S. pennata L., (2) S. richteriana Kar. & Kir., S. lessingiana Trin. & Rupr., S. heptapotamica Golosk. and S. korshinskyi Roshev, (3) S. lagascae and S. breviflora currently have a discrepancy between morphological and molecular data. In addition, the divergence time estimation indicates that the potential origin of the clade comprising S. capillata and S. grandis is in a range of 0.67-2.93 Mya while the sister clade has the 95% credibility intervals for that parameter in a range of 2.38-4.78 Mya. Furthermore, the lowest genetic divergence time was registered for S. lessingiana and S. richteriana (0.00-0.48 Mya) as well as for the split between S. heptapotamica and the two above-mentioned species (0.01-0.78 Mya). The divergence times for the rest of taxa are present in Table 5.    (Fig. 3). However, after a manual checking in IGV v.2.8.6 80 the final size of the chloroplast genome was slightly reduced to 137,823 bp. In addition, an analysis using Cp-hap 81 detected two structural haplotypes of the chloroplast genome: haplotype A 82 (LSC-IR, reverse-complement (rc)-SSCrc-IR) and haplotype B 83 (LSC-IRrc-SSC-IR). We also obtained one assembly using Unicycler v.0.4.8 84 resulted in 76 linear contigs from which 29 can be assigned to mitochondrial sequences with a total length of 1,668,569 bp. Due to the Unicycler assembly being more complex and none of the obtained contigs were likely to be circular in nature, for the downstream genome annotation we used the Flye assembly. In total, 112 and 133 genes were functionally annotated for mitochondrial and chloroplast genomes, respectively. The mitochondrial annotation resulted in 78 protein-coding genes, 4 ribosomal RNA genes and 30 tRNA genes. The chloroplast annotation contained 85 protein-coding genes, 8 ribosomal RNA genes and 40 tRNA genes. The chloroplast genome size of 137,823 bp generated with Flye and the number of annotated genes in the current study were similar to the known assemblies for S. capillata obtained by Illumina sequencing 57 . However, the previous genome assemblies were slightly longer, specifically 137,830 bp 86  with SNPs. The BLAST process revealed 58,701 Silico markers and 52,252 sequences with SNPs that were successfully mapped to 4,361 and 3,935 genome contigs, respectively. Thus, the current genome assembly has 95.72% of Silico markers and 98.64% of sequences with SNPs that are represented in 73.52% (the total length of 969.30 Mb) and 66.34% (940.37 Mb) of the contigs, respectively. In addition, we established that 50,953 Silico markers and 47,181 sequences with SNPs were present only in a single copy in the genome. Finally, we identified 30 Silico markers and 10 sequences with SNPs aligned to the mitochondrial genome and only 2 Silico markers and 4 sequences with SNPs that were found in the chloroplast genome.

Discussion
The number of sequenced plant genomes is rapidly increasing year by year serving as a fundamental resource for various genomic studies. In the current work, we present a 1004 Mb genome with the 23 × coverage of the most widespread feather grass species, S. capillata, using SMRT PacBio sequencing. The current assembly comprises 5,931 sequences with a contig N50 length of 351 kb ( Table 1). The BUSCO completeness score of 93.10% (Table 2), the observation of a large portion of TEs (57.68%, Table 4) and the presence of Silico (95.72%) and SNPs (98.64%) markers derived from the DArT platform indicate that the assembly is of high quality. Moreover, the proportion of TEs has been reported for the first time in the genus due to the previous de novo assemblies which were performed exclusively based on transcriptomic data 50,52,54 . In addition, here we also attempted to perform a reference-guided scaffolding of the assembled contigs. Nevertheless, although nearly all contigs of the S. capillata genome were assigned to the chromosomes of B. distachyon, H. vulgare and A. tauschii, it was not possible to estimate their proper position on the reference with an acceptable level of confidence (Table 3  Table S2). In general, in the absence of a high-density genetic linkage map the task of reconstructing pseudomolecules of chromosomes seems to be challenging. On the other hand, we believe that in order to improve the contiguity of the long-read assembly the high-throughput chromosome conformation capture (Hi-C) 88 technique should be applied. Currently, many studies on non-model species successfully utilised a combination of long-read techniques and Hi-C data to perform assemblies at chromosome scale [89][90][91] . Moreover, an additional key for improving this genome assembly in the future is merely to get more sequencing reads.
Recently, it was shown that contig length metrics are positively correlated with both read length and sequence coverage. Specifically, long-read assemblies in maize demonstrated that the highest contig N50 of 24.54 Mb was reached with a subread N50 of 21,166 bp and a 75-fold depth of coverage while the longest contig of 79.68 Mb was observed with the same subread N50 but with a 60-fold depth 92 . The newly generated genome has a GC content of 45.97% that is similar to the known estimates for species in Stipa varying in a range of 46.61-49.05% 93 , and more broadly to grasses ranging from 43.57% in O. sativa to 46.90% in Z. mays 94 . Recently, it was shown that a higher GC content in monocots is associated with adaptation to extremely cold and/or dry climates 95 . The genus Stipa highly supports this hypothesis due to the fact that all feather grasses are adapted to temperate, dry climates 36 . In addition, a positive correlation between the GC content and genome size was established 96 suggesting insertion of LTR retrotransposons as a potential driving force of genome enlargement 97 . Similarly, here we showed that the expansion of the S. capillata genome also resulted from insertions of repetitive sequences that occupy 57.68% of the genome including LTR retrotransposons (13.97%). However, among all repetitive sequences around 34.34% are currently unclassified (Table 4) Importantly, the presented genome size is roughly twice smaller than the expected size of 2,355 Mb and twice bigger than the expected monoploid size of 589 Mb estimated using flow cytometry 93 . Considering that we were unable to remove redundant sequences due to possible heterozygosity and the number of duplicated BUSCOs (Tables 1 and 2), it may be presumed that the current genome assembly combines two very distinct genomes. To the current knowledge, the vast majority of Stipa species have 44 (2n = 4x) chromosomes and are supposed to be tetraploids 41,104 . In addition, recently it was shown that a single-copy region ACC1 and a low-copy nuclear gene Recently, it was suggested that the latter two are more plausible 41,104 . Thus, in order to better assemble the S. capillata genome and verify if Stipa is an allopolyploid genus we suggest sequencing at chromosome level the close relative diploid species (2n = 22) from genera representing, e.g. Ptilagrostis Griseb., Achnatherum P. Beauv., e.g. In general, the number of genes in Poaceae varies from 28,835 in the smallest known genome, Oropetium thomaeum Trin. (2n = 20; genome size of 245 Mb) 111 , to 107,891 in T. aestivum (2n = 42; 14,547 Mb) 112 . Here, we reported 53,535 nuclear genes that were structurally annotated for the masked genome assembly. Such a number of genes was roughly 1.8 and 1.6 times smaller than previously determined for S. grandis (94,674 genes) 54 and S. purpurea (84,298 genes) 50 , respectively. On the other hand, the annotation analysis of the unmasked genome resulted in 81,224 genes associated with already known proteins. In comparison, only 65,047 functionally annotated genes were reported for S. grandis while S. purpurea had 58,966. Nonetheless, as RNA-seq data is currently unavailable for S. capillata, we believe that the current version of the genome annotation demands a further investigation to properly characterise the genes sets when the appropriate information will be available.
SSR markers are widely distributed across the genome and they are commonly applied in establishing genetic structure in Stipa. Previously, polymorphic microsatellite primers were reported in populations of S. purpurea (11 113 , 15 114 and 29 115 loci), S. pennata (7 loci 116 ), S. breviflora (21 loci 117 ) and S. glareosa (9 loci 118 ). In the present study, we identified 77,614 perfect SSR markers (Supplementary File 3) and 58 imperfect repeat motifs presented only in a single copy (Supplementary Table S3). Although we did not test them on the population level we are confident that such a number of new loci will be a valuable source for the farther development of SSR markers in S. capillata, and more generally in the genus Stipa. Additionally, the revealed loci could be used for the designing dominant inter simple sequence repeat (ISSR) markers 119 . Recently, the usefulness of applying ISSRs were shown for studies in S. bungeana 120 , S. ucrainica and S. zalesskii 121 , S. tenacissima 122 and the hybrid complex S. heptapotamica 123 .
According to the previous studies, based on three chloroplast loci 124 and four chloroplast loci and one nuclear region 105 105 . Thus, the latter estimate is close enough to the origin-age calculated in the current study. In addition, our data on the divergence time among S. richteriana, S. lessingiana and S. heptapotamica ( Fig. 2 and Table 5) conforms to the previous findings on the ongoing hybridisation among these taxa 123 suggesting NORs as a useful tool for revealing species of putative hybrid origin. Nonetheless, we believe that the current and previous estimates regarding the origin of Stipa should be treated with caution. Firstly, to our knowledge, there is still no available fossil data for any Stipa species from the Old World that can properly calibrate the historical diversification in the genus. Currently, the earliest definite Stipa caryopses were found in central Poland and are dated ca. 4,000 BC 126 . Secondly, available data demonstrate incongruence between chloroplast and nuclear loci analyses. In further studies we suggest utilising single-copy nuclear genes derived from whole genome sequencing projects. Thirdly, different sets of species and parameters used for inferring diversification dates may result in different estimates 127 . Finally, we report a 137,823 bp chloroplast genome that is similar to the known assemblies in Stipa and specifically in S. capillata 57 . Here we highlight the applicability of a long-read sequencing technology like PacBio for the straightforward assembling of plastomes using Flye 67,68 . In addition, due to the long-reads we were able to identify two haplotypes presented in S. capillata. This result supports the previous findings in Poaceae 81 suggesting that plastome structural heteroplasmy can be a common state in feather grasses. Moreover, for the first time in the genus Stipa, here we present a 438,037 bp mitochondrial genome. The current size of this genome is close to Alloteropsis semialata (R.Br.) Hitchc. (442,063 bp) 128 , T. aestivum (452,526 bp) 129 , Sorghum bicolor L. (468,628 bp) 130 and A. speltoides (476,091 bp) 131 . Nevertheless, the present version of the genome is constituted by four contigs rather than one circular sequence. Although the general acceptance among mitochondrial biologists is that plant mitochondrial genomes have a variety of configurations [132][133][134] , in order to verify if a more accurate assembly could be performed, we suggest reusing our data for a more comprehensive analysis of the mitochondrial structures within Stipa.

Materials and methods
Plant material and DNA extraction. Our research complies with relevant institutional, national, and international guidelines and legislation. A S. capillata sample from Kochkor River Valley, central Kyrgyzstan (Supplementary Table S4), was selected for genome sequencing. The sample was stored in silica gel at ambient temperature until DNA extraction was performed. Total genomic DNA was isolated from dried leaves after a six-month storage period using a CTAB large-scale DNA extraction protocol (Supplementary information S1, described in Supplementary File 6). DNA extraction was performed by SNPsaurus (USA). In addition, we isolated DNA from dried leaves using a Genomic Mini AX Plant Kit (A&A Biotechnology, Poland). Subsequently, quality check, quantification and concentration adjustment were accomplished using a NanoDrop One (Thermo Scientific, USA) and agarose gel electrophoresis visualisation. The concentration of the sample was adjusted to 50 ng/μL. The purified DNA sample (1 μg) was sent to Diversity Arrays Technology Pty Ltd (Canberra, Australia) for sequencing and DArT marker identification. Moreover, to test the phylogenetic power of NORs in Stipa, we supplemented the study with five specimens of S. richteriana Kar. & Kir, three of S. lessingiana Trin. & Rupr., four of S. heptapotamica Golosk. and four of S. korshinskyi Roshev. (Supplementary Table S4). The isolation of genomic DNA was performed from dried leaf tissues using a modified CTAB method 135 . Library construction and sequencing. In total, 5 ug of S. capillata genomic DNA were used to construct a PacBio library according to the 20 kb PacBio template preparation protocol omitting a shearing step. The size selection cut-off was set at 15 kb. The library preparation followed by sequencing on three PacBio Sequel SMRT cells (Pacific Biosciences, Menlo Park, CA, USA) was carried out by SNPsaurus, LLC. Prior to the assembly, reads from each SMRT cell were inspected and quality metrics were calculated using SequelQC v.1.1.0 136 . A high-density assay using the DArT complexity reduction method for S. capillata was performed according to a previously reported procedure 137 .
For the rest of the specimens used in the current study, the quality control using a fluorometer (PerkinElmer Victor3, USA) and gel electrophoresis, library construction using a TruSeq Nano DNA Library kit (350 bp insert size; Illumina, USA) and sequencing using 100 bp paired-end reads on an Illumina HiSeq 2500 platform (Illumina, USA) were performed by Macrogen Inc. (South Korea).
Nuclear genome assembly and validation. The execution of this work involved using many software tools, whose versions, settings and parameters are described in Supplementary information S2 (available in Supplementary File 6). The de novo assembly of the PacBio data was performed using Flye v.2.4 65,66 . The draft assembly was cleaned by running BLASTn v.2.10.0 138 against the NCBI nucleotide database v.5, and subsequently sending each BLAST hit to the JGI taxonomy server (https:// taxon omy. jgi-psf. org/) with a downstream step of keeping only plant contigs. Thereafter, Qualimap v.2.2.2 139 was used to identify mean coverage for each contig. In the final assembly we kept only contigs with an average coverage of more than 10x. In addition, overrepresented contigs (> 60x) were BLASTed against the NCBI nucleotide database v.5 and sequences assigned to chloroplasts and mitochondria were removed.
Due to the final assembly performed with Flye v.2.4 being roughly twice bigger than an expected monoploid genome size of 589 Mb 93 , we accomplished an additional assembly with FALCON v.0.2.5 68 and applied Purge Haplotigs v1.1.1 69 to filter redundant sequences due to possible heterozygosity. The assemblies' statistics were analysed using assembly-stats v.1.0.1 140 . In addition, in order to assess the completeness of the genome assemblies, we investigated the presence of highly conserved orthologous genes using BUSCO v.4.0.6 141 . Genome-wide identification of microsatellite markers. The unmasked nuclear genome, chloroplast and mitochondrial genome assemblies were screened for perfect mono-, di-, tri-, tetra-, penta-and hexa-nucleotide repeat motifs using Krait v.1.3.3 75 . We applied the following criteria: mono-nucleotide repeat motifs contain at least 12 repeats, di-nucleotide repeat motifs contain at least seven repeats, tri-nucleotide repeat motifs contain at least five repeats, tetra-, penta-and hexa-nucleotide repeat motifs contain at least four repeats. In the next step, we BLASTed the resulting contigs against the NCBI nucleotide database v.5, and sequences assigned to mitochondria were kept. Then, the PacBio subreads were mapped onto the kept contigs using Mini-map2, and only uniquely mapped reads were retained by Samtools. A new de novo assembly of the 15.51 Mb data was performed using Flye. In order to check if the mitochondrial contigs obtained by Flye could be merged into larger scaffolds we applied Circlator v.1.5.5 170 . However, the resulting sequences were identical to the Flye contigs. In addition, we used Unicycler v.0.4.8 84 with reads that were mapped onto the Flye contigs as a reference.
Further, to detect all possible structural haplotypes of the chloroplast genome we applied Cp-hap 81 . Next, we mapped raw reads onto the resulting mitochondrial contigs and the chloroplast genomes to manually check in IGV v.2.8.6 80 if any potential SNPs or indels are present. Eventually, annotations of the final mitochondrial contigs of 438,037 bp and the chloroplast genomes of 137,823 bp were performed using Geneious Prime v.2021.1.1 (https:// www. genei ous. com) based on 85% and 95% similarities to the reference genomes of mitochondria and chloroplasts, respectively (Supplementary Table S5).
In Silico mapping of DArT marker sequences. Since the DArT markers are designed to target active regions of the genome 171 , here we use them to validate the completeness of the nuclear genome assembly and www.nature.com/scientificreports/ improve the accuracy of data filtering in further genomic studies on Stipa. Two data types, Silico and SNPs markers, were mapped to the nuclear genome using BLASTn v.2.10.0. As a query we used trimmed DArT sequences in a range of 29-69 bp with the percent identity values to the reference genome of 95% or greater and removing alignments below 95% of a query.

Data availability
The raw PacBio reads are available at NCBI Sequence Read Archive 172 . The final genome assemblies are deposited into NCBI Assembly database under the following Accession Numbers: nuclear assembly (JAGXJF000000000) 67 ; mitochondrion assembly, contig 1 (MZ161090) 76 , contig 2 (MZ161091) 77 , contig 3 (MZ161093) 78 and contig 4 (MZ161092) 79 ; chloroplast assemblies, haplotype A (MZ146999) 82 and haplotype B (MZ145043) 83 . The masked and the unmasked versions of the nuclear genome annotation are presented in the Supplementary File 1