Draft genome sequence of subterranean clover, a reference for genus Trifolium

Clovers (genus Trifolium) are widely cultivated across the world as forage legumes and make a large contribution to livestock feed production and soil improvement. Subterranean clover (T. subterraneum L.) is well suited for genomic and genetic studies as a reference species in the Trifolium genus, because it is an annual with a simple genome structure (autogamous and diploid), unlike the other economically important perennial forage clovers, red clover (T. pratense) and white clover (T. repens). This report represents the first draft genome sequence of subterranean clover. The 471.8 Mb assembled sequence covers 85.4% of the subterranean clover genome and contains 42,706 genes. Eight pseudomolecules of 401.1 Mb in length were constructed, based on a linkage map consisting of 35,341 SNPs. The comparative genomic analysis revealed that different clover chromosomes showed different degrees of conservation with other Papilionoideae species. These results provide a reference for genetic and genomic analyses in the genus Trifolium and new insights into evolutionary divergence in Papilionoideae species.


Results
Genome sequencing and Assembly. The Australian subterranean clover variety, cv. Daliak, was subjected to whole genome shotgun sequencing. Single-end (SE), 520-660 bp paired-end (PE) and 2 kb, 5 kb, 8 Table S1). A total of 6.8 Gb overlap fragment (OF) reads were created from 16.7 Gb MiSeq PE reads by COPE 13 (Supplementary Fig. S1). Together with the 2.8 Gb 454 reads, the 6.8 Gb OF reads were assembled by Newbler 2.7. The resultant number of contigs was 101,010, totaling 414.8 Mb in length (Supplementary Table S2). In parallel, 23.2 Gb HiSeq SE and PE reads were assembled by SOAPdenovo2 14 (kmer = 61) and GapCloser 1.10 (p = 31). The assembled 603,937 sequences, totaling 421.8 Mb in length, were merged with the 101,010 Newbler contigs by GAM-NGS 15 and 92,729 merged scaffolds were consequently generated. A total of 44,900 super-scaffolds were created by further scaffolding of the 92,729 sequences using SSPACE2.0 16 , giving a total of 125.3 Gb MP reads. After excluding contaminated sequences and those shorter than 300 bases, the remaining 27,228 sequences were subjected to subsequent analysis as a draft genome sequence, TSUd_r1.0 (Supplementary Table S2).

SNP map and pseudomolecule construction.
A SNP linkage map was constructed with the F 2 population 92S80, generated from a cross between the subterranean clover cultivars Woogenellup and Daliak 17 . A total of 899,064 candidate SNPs were identified by mapping the 25.8 Gb Woogenellup and 337.9 Gb F 2 HiSeq 1000 PE reads onto the TSUd_r1.0 sequences, and 52,635 SNPs were subsequently selected for Axiom ® myDesign TM TG Array. Based on the genotypes of the 155 F 2 progenies obtained from the array, a SNP linkage map was constructed with 35,341 SNPs (Supplementary Table S3 and http://clovergarden.jp/). The mapped SNPs generated a total of 2,153 bins that were randomly located on eight linkage groups totaling 2,084 cM in length (Fig. 1, Supplementary Table S3).
A total of 1,505 TSUd_r1.0 scaffolds were anchored onto the linkage map through the mapped SNPs. Of the anchored scaffolds, 158 were located in multiple positions that suggested mis-assembly. Hence, the 158 scaffolds were split into 355 scaffolds according to the SNP positions on the linkage map. In addition, reverse complement sequences were created as revised scaffold sequences when the original sequences were anchored on the linkage map in a reverse direction. The modified TSUd_r1.0 sequences were designated as TSUd_r1.  subterranean clover genome (552.4 Mb), which was estimated based on kmer frequency distribution (kmer = 17, Supplementary Fig. S2). Assembly quality of the TSUd_r1.1 sequences was investigated by mapping assembled sequences onto the 248 core eukaryotic genes (CEGs) by CEGMA 18 . The numbers of 'complete' and 'complete or partial' mapped CEGs were 237 (95.6%) and 243 (98.0%), respectively, indicating high reliability of the assembled sequences.
The total number and length of anchored TSUd_r1.1 scaffolds on the SNP map were 1,702 and 384,208,136 bp (81.4% of TSUd_r1.1), respectively. Eight pseudomolecules, chr1-chr8, were constructed by connecting the anchored TSUd_r1.1 scaffolds with insertions of 10,000 N bases in between adjacent scaffolds ( Fig. 1 Table S5). Of the known types of repeats, Class I LTR (Long Terminal Repeat) elements were observed most frequently. The ratios of repetitive sequences of the eight pseudomolecules ranged from 36.1% (chr7) to 46.0% Mb (chr6). A higher ratio of repetitive sequences was observed in chr0 (scaffolds that were not in pseudomolecules). Of the unique repeats observed on the eight pseudomolecules, 67.4 Mb (62.8%) sequences were commonly observed in at least one of the other four legume species (red clover, M. truncatula, L. japonicus and common bean), while the other 39.8 Mb (37.2%) were subterranean clover-specific (Supplementary Table S6). A total of 61,402 SSR sequences were identified in TSUd_r1.1 (Supplementary Table S7). The average frequency of SSRs in overall and coding sequence (CDS) was 12.4 and 5.1 per 100 kb, respectively. The SSR frequency in CDS was lower than in the other four legume species.
Gene prediction and annotation. Gene prediction of TSUd_r1.1, employing MAKER 20 , yielded a total 28,372 genes. Moreover, a further 14,334 genes were additionally predicted by Augustus 21 , giving a total of 42,706 predicted genes with average coding sequence length of 1,123 bp (Supplementary Table S8). In addition, 1,007 tRNA and 92 rRNA encoding genes were identified (Supplementary Table S9). Among the 42,706 putative genes, 37,085 (86.8%) were classified as non-TE genes whereas 5,621 (13.2%) were transposon elements (TEs), based on BLAST and domain searches against NCBI NR and InterPro 22 databases, respectively. For comparisons with the gene sequences in other legume species, 36,800 subterranean clover putative non-TE genes were clustered with the genes predicted in red clover, M. truncatula, L. japonicus, and common bean. This generated a total of 30,048 clusters. The number of subterranean clover-specific clusters was 14,745, whereas that commonly observed in genus Trifolium (subterranean and red clovers), tribe Trifolieae (clovers and M. truncatula) and clade Hologalegina (clovers, M. truncatula and L. japonicus) was 1,906, 2,402 and 1,749, respectively (Fig. 2). A total of 6,388 (21.3%) clusters were observed in all five legume species ( Supplementary Fig. S3).
The 36,800 putative genes were further annotated using GO 23 , KOG 24 and KEGG 25 databases. A total of 30,543 (83.0%) genes were annotated with GO categories, including 10,328 genes involved in biological processes, 4,400 genes coding for cellular components and 15,815 genes associated with molecular function (Supplementary Fig. S4). Of the 16,995 subterranean clover-specific genes (See Fig. 2), only 6,449 (38%) genes were annotated while most genes (12,351, 88%) in the 'Other' cluster were classified in the GO categories. This result indicates that many of the subterranean clover-specific genes are novel with unknown function. Meanwhile, a total of 31,335 putative genes showed significant similarity to genes in the KOG database and 3,520 (11.2%), 6,004 (19.2%) and 5,165 (16.5%) genes were annotated in 'information storage and processing' , 'cellular processes and signaling' and 'metabolism' , respectively ( Supplementary Fig. S5). Genes belonging to the tribe Trifoliae cluster showed a higher ratio of 'poorly characterized' genes. The putative genes were also mapped onto a total of 1,688 enzyme encoding genes on KEGG pathways categorized '1. Metabolism' (Supplementary Table S10).
Whole structure and variance in the genome of subterranean clover. When the distribution of repetitive and putative gene sequences were surveyed at a whole genome scale, larger ratios of repetitive sequences and a smaller number of genes were observed in midsections of chr1, chr3, chr4, chr6, chr7 and chr8, suggesting heterochromatin regions (Fig. 3A). However, candidate heterochromatin regions in chr2 and chr5 were insufficiently resolved to observe clearly. The ratios of common genes among the five legume species were lower in chr3, chr6 and chr8 than the other pseudomolecules (Fig. 3B). Higher ratios of subterranean clover-specific genes tended to be observed in euchromatin regions, with gene density being higher in chr1 and chr7 and lower in chr3 and chr6. Copy number variations (CNVs) in the genomes of 'Daliak' and 'Woogenellup' were detected using the CNV-seq program 26 . Higher log2 ratio (log-transformed ratio of the number of mapped reads in Woogenellup to the number in Daliak) were observed in the regions where densities of repetitive sequences were high (Fig. 3C).  The functions of the 35,341 SNPs mapped on the pseudomolecule on gene functions were predicted using SnpEff 27 , which groups SNPs into four categories (high-impact, moderate, modifier, and low), based on their positions in the genome sequences ( Supplementary Fig. S6). Among 36,723 annotations on the 35,341 SNPs, 287 (0.8%) were classified as high-impact SNPs, including "splice acceptor and donor variants", "loss of the start codon", or "gain/loss of the stop codon". Another 4,894 loci (13.3%) were classified as "moderate effects" (mis-sense variants), 26,107 (71.1%) as "modifiers" (e.g. variants in intergenic regions and introns), and 5,435 (14.8%) as "low-impact" (e.g. synonymous variants, Supplementary Table S11).
Comparative analysis with other legume species. A similarity search was performed, based on comparisons with the translated protein sequences of red clover, M. truncatula, L. japonicus and common bean ( Fig. 4A and Supplementary Fig. S7). Alignment of homologous sequence pairs along each pseudomolecule (Tsud_chr) revealed obvious syntenic relationships with chromosomes in M. truncatula (Mt_chr). Single synteny blocks were observed for three pseudomolecules (Tsud_chr1-M. truncatula chromosome 1(Mt_chr1), Tsud_ chr3-Mt_chr3 and Tsud_chr5-Mt_chr5). Tsud_chr, 6, 7 and 8 shared two synteny blocks with Mt_chr3-chr6, Mt_chr3-chr6 and Mt_chr2-Chr8, respectively. The remaining two pseudomolecules shared three synteny blocks with Mt-chrs. Duplicated synteny blocks were observed in Tsud_chr2, chr4, chr5 and chr6. By contrast, clear synteny blocks were not shown between subterranean and red clovers. This might be caused by the shorter pseudomolecule (164.2 Mb of 309 Mb assembled genome) and larger ratio of unknown amino acid sequences in red clover (Percentages of X in translated protein sequences was 1.48% in red clover and 0.02% in subterranean clover). Nevertheless, possible synteny blocks were observed between Tsud_chr1-red clover linkage group 1 (Rc_LG1), Tsud_chr2-Rc_LG3, Tsud_chr5-Rc_LG2, Tsud_chr6-RC_LG2-LG4 and Tsud_chr7-RC_LG6-LG7. Clear synteny blocks were also observed between the subterranean clover genome and the other legume species. However, the blocks were more fragmented when compared with both L. japonicus and common bean, than when compared against M. truncatula. Overall, Tsud_chr1 is more conserved across the five legume species than the other pseudomolecules.
Fifty-four percent of gene pairs between subterranean and red clovers showed a synonymous-substitution rate (Ks value) less than 0.1, suggesting a close relationship between the two clovers (Fig. 4B). A similar distribution of Ks value was observed between M. truncatula and the two clovers, although the basic chromosome numbers of the two clovers differ (subterranean clover n = 8, red clover n = 7). Interestingly, the peaks of Ks value between L. japonicus and common bean were almost the same as that between the three tribe Trifoleae species and L. japonicus.
To infer the divergence time among Arabidopsis thaliana and the five legume species (common bean, L. japonicus, M. truncatula, red clover and subterranean clover), a phylogenetic tree was constructed based on 280 single copy genes commonly observed across the six species. According to the phylogenetic tree, it was estimated that the Hologalegina and Millettioid clades diverged ~54.8 MYA (Fig. 4C, Supplementary Fig. S8), similar to the study of Lavin et al. 28 . The divergence time between Robinioid (including L japonicus) and IRLC (Inverted Repeat Lacking Clade, including tribe Trifoliae) clades was ~42.7 MYA, while the genera Trifolium and Medicago diverged ~19.0 MYA. These two time estimations were more recent than the previous study of ~47.6 MYA and ~23 MYA, respectively 9 , which could be due to the different number of single copy genes used in the studies (280 in this manuscript and 818 in de Vega et al. 9 ). The divergence time between subterranean clover and red clover was estimated as 7.45 MYA.

Discussion
We constructed a 471.8 Mb subterranean clover draft genome, including eight pseudomolecules, totaling 401.1 Mb. This is the first draft genome of an annual forage clover. The assembled sequence covers 85.4% of the subterranean clover genome and is of high quality, according to CEGMA. Based on the 42,706 predicted genes, 14,745 subterranean clover-specific gene clusters were generated. These gene clusters will be a source for discovery of unique and potentially valuable genes in the species for a range of traits, including those leading to increased biomass production and persistence, disease and pest resistance, increased phosphorous-use efficiency and reduced methane emissions from grazing ruminant livestock 2 . Moreover, the high density linkage map, consisting of 35,341 SNPs and 287 high-impact SNPs annotated by SnpEff, will also greatly advance genetic and genomic analyses of subterranean clover.
Comparative analysis with the other legume species provided an insight into the evolution of Papilionoideae species. Synteny analysis revealed that the whole chromosome structure of chr1 in subterranean clover was highly conserved across the five Papilionoideae species compared and markedly more conserved than the other chromosomes. By contrast, duplicate synteny blocks were observed in subterranean clover chr2 against M. truncatula and red clover, suggesting that the genome duplication in chr2 occurred after the divergence of red and subterranean clovers ~7.45 MYA.
It is difficult to obtain high quality draft genomes in many of the most important forage legume species. The perennial species, red clover, white clover and alfalfa, show strong self-incompatibility, making it difficult to produce homozygous plants for genetic and genomic analysis. Even though a draft genome for red clover has been published 9 , the quality of the subterranean clover genome in this study is much higher. For example, the percentages of complete and partial mapped CEGs by CEGMA in red clover were 85.5% and 95.6%, respectively, whereas those in subterranean clover were 95.6% and 98.0%. In addition, white clover and alfalfa are polyploid species. Subterranean clover is much better suited than these species for genomic and genetic studies because it is an annual, autogamous and diploid species. the results obtained in this study are expected to be a reference for genetic and genomic analyses within the Trifolium genus and also among other forage legumes. In the case of subterranean clover, a genome scaffold, a high resolution QTL map, the development of a world core collection and extensive phenotyping for important agro-morphological and economic traits have generated sufficient molecular genetic information to establish a comprehensive molecular breeding platform.

Materials and Methods
Genome sequencing and Assembly. An Australian subterranean clover variety, cv. Daliak, was subjected to whole-genome shotgun sequencing using the Roche 454 GS FLX+ , Illumina HiSeq 2000 and MiSeq platforms. Total cellular DNA was used for construction of a SE library for Roche 454 and PE libraries for Illumina HiSeq 2000 and MiSeq sequencing platforms according to the manufacturer's instructions. A modified protocol proposed by Nieuwerburgh et al. 29 was employed for MP library preparation. The obtained reads were subjected to quality control, as follows. Bases with quality scores less than 10 were filtered by PRINSEQ 0.20.4 30 and adaptor sequences in the reads were trimmed using fastx_clipper from the FASTX-Toolkit 0.0.13 (http://hannonlab.cshl. edu/fastx_toolkit).
Gene prediction and annotation. tRNA genes were predicted using tRNAscan-SE ver. 1.23 36 with default parameters. rRNA genes were predicted by BLAST searches with an E-value cutoff of 1E-10. The A. thaliana 5.8S and 25S rRNAs (accession number: X52320.1) and 18S rRNA (accession number: X16077.1) were used as query sequences.
Protein-encoding sequences in the assembled genomic sequences were predicted by MAKER 20 . Published SRA transcript sequences of red clover (SRX351919, SRX351918, SRX351917, SRX351791) and white clover (ERX324290) were assembled by Trinity 37 for evidence-based gene prediction. The assembled 431,700 red clover and 217,133 white clover unigenes were mapped onto the subterranean pseudomolecules by BLAT (-t = dnax -q = dnax minIdentity = 25), together with the 62,319 M. truncatula CDS (4.0 v1). In parallel, Ab initio gene prediction was performed by Augustus 21 using the A. thaliana training set. The two sets of predicted gene sequences were merged by MAKER. Genes related to transposable elements (TEs) were detected by BLASTP searches against the NCBI non-redundant (nr) protein database (http://www.ncbi.nlm.nih.gov) with an E-value cutoff of 1E-10, and InterProScan 38 searches against the InterPro database 22 with an E-value cutoff of 1.0. The putative non-TE genes were classified into the plant gene ontology (GO) slim categories 23 , and the "euKaryotic clusters of Orthologous Groups" (KOG) categories 24 and then mapped onto the Kyoto Encyclopedia of Genes and Genomes (KEGG) reference pathways 25 .
Comparative analysis. Translated protein sequences of subterranean clover were clustered by using the CD-HIT program 39 with those of common bean (v1.0) 5 , L. japonicus (rel.3.0) 10 , M. truncatula (4.0 v1) 11 , and red clover (v2.1) 9 with the parameters c = 0.6 and aL = 0.5. Homologous translated proteins sequences were searched by BLAST searches with an E-value cut-off of 1E-100, and the synteny plot was made using the gnuplot program (http://www.gnuplot.info). Ks value was estimated for the gene pairs with reciprocal best hits of BLAST searches with an E-value cut-off of 1E-10, using KaKs Calculator 40 .
Phylogenetic analysis. CD-HIT analysis against the six species (A. thaliana (TAIR10), common bean (v1.0), L. japonicus (rel.3.0), M. truncatula (4.0 v1), red clover (v2.1), and subterranean clover (TSUd_r1.0)) were conducted with the parameters c = 0.6 and aL = 0.9. The single copy genes in each cluster commonly conserved among the six species were applied to multiple alignment by MUSCLE 3.8.31 41 . The indels in the aligned sequences of the single copy genes were eliminated by Gblocks 0.91b 42 . The sequences of conserved blocks in the single copy genes were concatenated for each species and were used for construction of a phylogenetic tree by the Maximum-Likelihood (ML) algorithm, using MEGA 7.0.9 beta 43 with the Jones-Taylor-Thornton (JJT) model as substitution model. The time values were calculated according to the distances in the phylogenetic tree constructed by Maximum Likelihood method. The uniform rates defined in the MEGA program were used for converting the distances to time values. In this step, A. thaliana was selected as outgroup. According to TIMETREE (http://www.timetree.org), the divergence time between A. thaliana and M. truncatula is 114.0 MYA, and this value was used for calibration. The divergence times among the six species were calculated by using MEGA 7.0.9 beta.