Chromosome-scale genome assembly of bread wheat’s wild relative Triticum timopheevii

Wheat (Triticum aestivum) is one of the most important food crops with an urgent need for increase in its production to feed the growing world. Triticum timopheevii (2n = 4x = 28) is an allotetraploid wheat wild relative species containing the At and G genomes that has been exploited in many pre-breeding programmes for wheat improvement. In this study, we report the generation of a chromosome-scale reference genome assembly of T. timopheevii accession PI 94760 based on PacBio HiFi reads and chromosome conformation capture (Hi-C). The assembly comprised a total size of 9.35 Gb, featuring a contig N50 of 42.4 Mb and included the mitochondrial and plastid genome sequences. Genome annotation predicted 166,325 gene models including 70,365 genes with high confidence. DNA methylation analysis showed that the G genome had on average more methylated bases than the At genome. In summary, the T. timopheevii genome assembly provides a valuable resource for genome-informed discovery of agronomically important genes for food security.

Triticum timopheevii ssp.timopheevii has been exploited in various studies for wheat improvement as it has been shown to be an abundant source for genetic variation for many traits such as resistance to leaf rust [11][12][13] , stem rust [14][15][16] , powdery mildew [16][17][18] , Fusarium head blight 19,20 Hessian fly, Septoria blotch, wheat curl mite and tan spot 21 .It has also been shown to have tolerance to abiotic stresses such as salinity 22,23 and be a good source for traits affecting grain quality such as milling yield and grain protein 24 and grain mineral content 25 .During sequence analysis of reference quality assemblies (RQA) of 10 wheat cultivars, recent studies found two of them, cv.LongReach Lancer and cv.Julius, contained major introgressions on Chr2B (among others) potentially originating from T. timopheevii 26,27 .Introgressions from T. timopheevii have also been found in many other wheat accessions present in genebank collections 28 .Pre-breeding programmes involving the introgression of the whole genome of T. timopheevii, in small segments, into bread wheat 10,29 with diagnostic KASP markers that can track these introgressions in wheat 29,30 have provided promising new germplasm and tools to the wheat research community.
In this study, we report a chromosome-scale reference genome sequence assembly for T. timopheevii by integrating chromatin conformation capture (Hi-C) derived short-reads 31 with PacBio HiFi long-reads 32 .The assembly was annotated for gene models and repeats.CpG methylation along the chromosomes was inferred from the PacBio CCS data.The high-quality T. timopheevii genome assembly obtained in this study provides a reference for the G genome of the Triticum genus.This new resource will form the basis to study chromosome rearrangements across different Triticeae species and will be explored to detect A t and G genome introgressions in durum and bread wheat allowing future genome-informed gene discoveries for various agronomic traits.

Methods
Plant material, nucleic acid extraction and sequencing.High molecular weight (HMW) DNA was extracted from a young seedling (dark-treated for 48 hours) of T. timopheevii accession PI 94760 (United States National Plant Germplasm System, NPGS available at https://npgsweb.ars-grin.gov/gringlobal/search)using a modified Qiagen Genomic DNA extraction protocol (https://doi.org/10.17504/protocols.io.bafmibk6) 33 .DNA was sheared to the appropriate size range (15-20 kb) and PacBio HiFi sequencing libraries were constructed by Novogene (UK) Company Limited.Sequencing was performed on 10 SMRT cells of the PacBio Sequel II system in CCS mode with kinetics option to generate ~267 Gb (~28-fold coverage) of long HiFi reads (Supplementary Table S1).Four Hi-C libraries were prepared using leaf samples (from the same plant used for HMW DNA extraction), at Phase Genomics (Seattle, USA) using the Proximo ® Hi-C Kit for plant tissues according to the manufac- turer's protocol.The Hi-C libraries were sequenced on an Illumina NovaSeq 6000 S4 platform to generate ~2.8 billion of paired end 150 bp reads (~842 Gb raw data; ~89-fold coverage; Supplementary Table S2).
Total RNA was extracted from seedlings (3-leaf stage), seedlings at dusk, roots, flag leaves, spikes and grains.Flag leaf and whole spike were collected at 7 days post-anthesis and whole grains were collected at 15 days post-anthesis.In brief, 100 mg of ground powder from each tissue was used for RNA isolation using the RNeasy Plant Mini Kit (#74904, QIAGEN Ltd UK).The RNA samples were split into 2 aliquots, one for mRNA sequencing (RNA-Seq) and one for Iso-Seq 34 .Library construction for both types of sequencing was carried out by Novogene (UK) Company Limited.Illumina NovaSeq 6000 S4 platform was used for mRNA sequencing to generate on average 450 million reads (~67 Gb of 2 × 150 bp reads) for each sample (Supplementary Table S3).The second set of RNA aliquots from each of the six tissues were pooled into one sample and sequenced on the PacBio Sequel II system using the Iso-Seq pipeline to generate 4.47 Gb of Iso-Seq data (Supplementary Table S4a) which was analysed using PacBio Iso-Seq analysis pipeline (SMRT Link v12.0.0.177059).
Plants were grown in a glasshouse in 2 L pots containing John Innes No. 2 soil and maintained at 18-25 °C under 16 h light and 8 h dark conditions.All sequencing was carried out by Novogene (UK) Company Limited.
Chromosome-scale genome assembly.The cleaned HiFi reads were assembled into the initial set of contigs using hifiasm v.0.16.1 40 with default parameters and the dataset was assessed using gfastats v1.3.1 41 .The contig assembly had a total size of ~9.41 Gb, with a contig N50 value of 43.12 Mb.Genome completeness was assessed for the contig assembly using the Benchmarking Universal Single-Copy Orthologs (BUSCO v5.3.2) 42 program with the embryophyta_odb10 database which yielded 99% of the complete BUSCO genes.Contaminants (contigs other than those categorised as Streptophyta or no hit) were identified using BlobTools v1.1.1 43 and removed.
To achieve chromosome-level assembly, the trimmed Hi-C data 44 was mapped onto the decontaminated contig assembly using the Arima Genomics ® mapping pipeline (available at https://github.com/ArimaGenomics/mapping_pipeline) and chromosome construction was conducted using the Salsa2 45 pipeline (available at https:// github.com/marbl/SALSA)with default parameters and GATC as the cutting site for the restriction enzyme (DpnII).The Hi-C contact map for the scaffold assembly was constructed using PretextMap v0.1.9and the chromatin contact matrix was manually corrected using PretextView v0.2.5 by following the Rapid Curation pipeline 46 (https://gitlab.com/wtsi-grit/rapid-curation).The curated assembly was assessed using gfastats to consist of 14 pseudomolecules and 1656 unplaced scaffolds with a total length of 9,350,839,849 bp (including gaps) and a contig N50 of 42.4 Mb (Table 1).The orientation and the chromosome name of each pseudomolecule were determined based on homology with the wheat cv.Chinese Spring assembly RefSeq 2.1 47 A and B subgenomes, using dotplot comparison of sequence alignments produced by MUMmer's (v3.23 48 ) nucmer aligner and viewed on Dot (available at https://github.com/marianattestad/dot).The pseudomolecules were thus, renamed into the 14 T. timopheevii chromosomes, seven A t genome chromosomes with a total length of ~4.85 Gb and consisting of 119 contigs and seven G genome chromosomes with a total length of ~4.40 Gb and consisting of 529 contigs (Table 2).
organellar genome assembly.De novo assembly of the organelle genomes was carried out using the Oatk pipeline (available at https://github.com/c-zhou/oatk)with the cleaned HiFi reads.The circular chloroplast and mitochondrial contigs were assembled with a total size of 136,158 bp and 443,464 bp, respectively.Any unanchored contigs that aligned to these extranuclear genomes were removed from the final assembly.using a curated set of high confidence T. aestivum coding genes to hard mask the RepeatModeler library; transposable element genes were first excluded from the T. aestivum coding gene set by running Trans-posonPSI (r08222010).Unclassified repeats were searched in a custom BLAST database of organellar genomes (mitochondrial and chloroplast sequences from Triticum in the NCBI nucleotide division).Any repeat families matching organellar DNA were also hard-masked.Repeat identification was completed by running RepeatMasker v4.0.72 with a RepBase embryophyte library and with the customized RepeatModeler library (i.e. after masking out protein coding genes), both using the -nolow setting.Overall, 78.43% of the assembly was classified as repetitive sequences (Table 3).The consolidated set of repeat features (i.e.RepeatMasker outputs from the embryophyte and customized RepeatModeler libraries) were given as input to the evidence guided gene prediction (REAT prediction) and gene model consolidation (Minos) steps.All other annotation steps utilised the unmasked genome.

Reference guided transcriptome reconstruction
Gene models were derived from the RNA-Seq reads, Iso-Seq transcripts (122,253 high quality and 82 low quality isoforms; Supplementary Table S4b) and Full-Length Non-Concatamer Reads (FLNC) using the REAT transcriptome workflow.HISAT2 v2.2.1 51 was selected as the short read aligner with Iso-Seq transcripts aligned with minimap2 v2.18-r1015 52 , maximum intron length was set as 50,000 bp and minimum intron length to 20 bp.Iso-Seq alignments were required to meet 95% coverage and 90% identity.High-confidence splice junctions were identified by Portcullis v 1.2.4 53 .RNA-Seq Illumina reads were assembled for each tissue with StringTie2 v2.1.5 54and Scallop v0.10.5 55 , while FLNC reads were assembled using StringTie2 (Supplementary Table S5).Gene models were derived from the RNA-Seq assemblies and Iso-Seq and FLNC alignments with Mikado.Mikado was run with all Scallop, StringTie2, Iso-Seq and FLNC alignments and a second run with only Iso-Seq and FLNC alignments (Supplementary Table S6).

Evidence guided gene prediction
The evidence guided annotation of protein coding genes based on repeats, RNA-Seq mappings, transcript assembly and alignment of protein sequences was created using the REAT prediction workflow.The pipeline has four main steps: (1) the REAT transcriptome and homology Mikado models are categorised based on alignments to UniProt proteins to identify models with likely full-length CDS and which meet basic structural checks i.e., having complete but not excessively long UTRs and not exceeding a minimum CDS/cDNA ratio.A subset of gene models is then selected from the classified models and used to train the AUGUSTUS gene predictor 58 ; (2) Augustus is run in both ab initio mode and with extrinsic evidence generated in the REAT transcriptome and homology runs (repeats, protein alignments, RNA-Seq alignments, splice junctions, categorised Mikado models).Three evidence guided AUGUSTUS predictions are created using alternative bonus scores and priority based on evidence type; (3) AUGUSTUS models, REAT transcriptome/homology models, protein and transcriptome alignments are provided to EVidenceModeler 59 (EVM) to generate consensus gene structures; (4) EVM models are processed through Mikado to add UTR features and splice variants.

Projection of gene models from Triticum aestivum
A reference set of hexaploid wheat gene models was derived from public gene sets (IWGSC 60 and 10+ wheat 26 ) projected onto the IWGSC RefSeq v1.0 assembly 60 ; a filtered and consolidated set of models was derived with Minos, with a primary model defined for each gene.Models were scored on a combination of intrinsic gene structure characteristics, evidence support (protein and transcriptome data) and consistency in gene structure across the input gene models.The Minos primary models were classified as full-length or partial based on alignment to a filtered magnoliopsida Swiss-Prot TrEMBL database.This assignment, together with criteria for gene structure characteristics and the original confidence classification, was used to classify models into 6 categories (Platinum, Gold, Silver, Bronze, Stone and Paper), with Platinum being the highest confidence category for models assessed as full-length, with an original confidence classification of "high", meeting structural checks for number of UTR and CDS/cDNA ratio and which were assessed as consistently annotated across the input gene sets.Reclassification resulted in 55,319 Platinum, 24,789 Gold, 11,968 Silver, 61,845 Bronze, 110,518 Stone and 115,336 Paper genes.The four highest confidence categories Platinum, Gold, Silver and Bronze were projected onto the T. timopheevii assembly with Liftoff v1.5.1 61 , only those models transferred fully with no loss of bases and identical exon/intron structure were retained (https://github.com/lucventurini/ei-liftover).Similarly, high confidence genes annotated in the hexaploid wheat cv.Chinese Spring RefSeq v2.1 assembly 47 were projected onto the T. timopheevii genome assembly with Liftoff, and only those models transferred fully with no loss of bases and identical exon/intron structure were retained.Among these, gene models with the attribute "manually_curated" in the original Refseq v2.1 assembly were extracted as a set.

Gene model consolidation
The final set of gene models was selected using Minos (Table 4).Minos is a pipeline that generates and utilises metrics derived from protein, transcript, and expression data sets to create a consolidated set of gene models.In this annotation, the following gene models were filtered and consolidated into a single set of gene models using Minos: 1.The three alternative evidence guided Augustus gene builds described earlier.
2. The gene models derived from the REAT transcriptome runs described earlier.
3. The gene models derived from the REAT homology runs described earlier.
4. The gene models derived from the REAT prediction run (AUGUSTUS and EVM-Mikado) described earlier.
5. The gene models derived from projecting public and curated T. aestivum gene models of varying confidence levels onto the T. timopheevii genome as described earlier.6. IWGSC Refseq v2.1 models identified as "manually_curated" projected onto the T. timopheevii genome as described earlier.
Gene models were classified as biotypes protein_coding_gene, predicted_gene and transposable_element_ gene, and assigned as high or low confidence (Table 5) based on the criteria below: a. High confidence (HC) protein_coding_gene: Any protein coding gene where any of its associated gene models have a BUSCO v5.4.7 62 protein status of Complete/Duplicated OR have diamond v0.9.36 coverage (average across query and target coverage) >= 90% against the listed Poaceae protein datasets (Supplementary Table S7) or UniProt magnoliopsida proteins.Or alternatively have average blastp coverage (across query and target coverage) >= 80% against the listed protein datasets/UniProt magnoliopsida AND have transcript alignment F1 score (average across nucleotide, exon and junction F1 scores based on RNA-Seq transcript assemblies) >= 60%.
b. Low confidence (LC) protein_coding_gene: Any protein coding gene where all its associated transcript models do not meet the criteria to be considered as high confidence protein coding transcripts.c.HC transposable_element_gene: Any protein coding gene where any of its associated gene models have coverage >= 40% against the combined interspersed repeats (see section 1).d. LC transposable_element_gene: Any protein coding gene where all its associated transcript models do not meet the criteria to be considered as high confidence and assigned as a transposable_element_gene (see c). e. LC predicted_gene: Any protein coding gene where all the associated transcript models do not meet the criteria to be considered as high confidence protein coding transcripts.In addition, where any of the associated gene models have average blastp coverage (across query and target coverage) <30% against the listed protein datasets AND having a protein-coding potential score <0.25 calculated using CPC2 0.1 63 .f. LC ncRNA gene: Any gene model with no CDS features AND a protein-coding potential score <0.3 calculated using CPC2 0.1.g.Discarded models: Any models having no BUSCO protein hit AND a protein alignment score (average across nucleotide, exon and junction F1 scores based on protein alignments) <0.2 AND a transcript alignment F1 score (average across nucleotide, exon and junction F1 scores based on RNA-Seq transcript assemblies) <0.2 AND a diamond coverage (target coverage) <0.3 AND Kallisto v0.44 64 expression score <0.2 from across RNA-Seq reads OR having short CDS <30 bps.Any ncRNA genes (no CDS features) not meeting the ncRNA gene requirements (f) were also excluded.
Gene model distribution across the pseudomolecules and unplaced scaffolds is shown in Table 2 and gene density of 164,617 protein coding genes across the T. timopheevii genome was calculated using deepStats v0.4 65 in 10 Mb bins (Fig. 2b).

Functional annotation
All proteins were annotated using AHRD v.    Since T. timopheevii is known as an important source for genetic variation for resistance against major diseases of wheat as described above and as the majority of cloned disease-resistance genes encode nucleotide-binding leucine-rich repeats (NLRs) 68,69 , we validated the annotation of all gene models annotated as NB-ARC domain-containing/disease resistance proteins in the genome assembly (2399 gene models) using NLR-Annotator v2 70 and found an additional 166 NLRs (total 2565).We plotted the genomic distribution of the larger set (Fig. 2c), by calculating the density in 10 Mb bins using deepStats v0.4,which shows concentration of these NLRs at mostly distal ends of the chromosomes of Triticum timopheevii.
Generation of PacBio DNA methylation profile.Methylation in CpG context was inferred with ccsmeth v0.3.2 71 , using the kinetics data from PacBio CCS subreads obtained during HMW DNA sequencing.The methylation prediction for CCS reads were called using the model "model_ccsmeth_5mCpG_call_ mods_attbigru2s_b21.v2.ckpt".The reads with the MM + ML tags were aligned to the pseudomolecules in the T. timopheevii assembly using BWA v0.7.17 72 .The methylation frequency was calculated at genome level with the modbam files and the aggregate mode of ccsmeth with the model "model_ccsmeth_5mCpG_aggregate_ attbigru_b11.v2p.ckpt".The genomic distribution of 5mC modifications across T. timopheevii (Fig. 2d) shows that G genome chromosomes have more methylation with an average of ~401.8Kbp methylated bases per 10 Mb bin as compared to the A t genome chromosomes with an average of ~385.5 Kbp per 10 Mb bin (calculated using deepStats v0.4).

Data records
The raw sequence files for the HiFi, Hi-C, RNA-Seq and IsoSeq reads were deposited in the European Nucleotide Archive (ENA) under accession number PRJEB71660 73 .The final chromosome-scale assembly consisting of the nuclear and organelle genomes was deposited at ENA under the accession number GCA_963921465.1 74 .
The genome assemblies, gene model and repeat annotations, methylation profile and Hi-C contact map are also available at on DRYAD Digital Repository 75 .
technical Validation assessment of genome assembly and annotation.The final curated assembly was assessed by mapping the trimmed Hi-C reads to the post-curated assembly (as described above for scaffolding) and generating a final Hi-C contact map using PretextMap v0.1.9and viewed using PretextView v0.2.5.It showed a dense dark blue pattern along the diagonal revealing no potential mis-assemblies (Fig. 3).The anti-diagonals in the Hi-C contact matrix were expected and have been reported for other relatively large plant genomes such as those from the Triticeae tribe 76,77 as they correspond to the typical Rabl configuration of Triticeae chromosomes 78,79 .
The BUSCO v5.3.2 42 (-l poales_odb10) score of 99.1% (0.1% fragmented and 0.8% missing BUSCOs; Supplementary Table S8a) at the genome level indicates a high completeness of the T. timopheevii assembly.The quality of the T. timopheevii assembly was assessed with Merqury 80 based on the PacBio HiFi reads using 31-mers.The QV (consensus quality value) and k-mer completeness scores were 65.5 and 97.8%, respectively.The quality of the assembly was further assessed by determining the LTR Assembly Index (LAI) and attainment of a value of 13.62 suggests that the assembly meets the criteria for a reference quality genome 81 indicating a high level of accuracy and completeness in capturing genomic features, particularly those related to LTR retrotransposons.
Completeness of the gene model prediction was also evaluated using BUSCO (-l poales_odb10) and produced a score of 99.9% (0.0% fragmented and 0.1% missing BUSCOs; Supplementary Table S8b).The number of HC gene models (70,365) is in the range of a tetraploid Triticeae species (34,000-43,000 high-confidence gene models per haploid genome) 82 .

Usage Notes
A genome browser for the assembly of T. timopheevii generated in this study is currently being hosted at GrainGenes 83 (https://wheat.pw.usda.gov/jb?data=/ggds/whe-timopheevii) with tracks for annotated gene models and repeats and BLAST functionality available at https://wheat.pw.usda.gov/blast/.

Fig. 2
Fig. 2 Circos plot 84 of features of the chromosome-scale assembly of T. timopheevii showing (a) major translocations with the T. timopheevii genome as observed through collinearity analysis against T. turgidum, (b) gene density (of all gene models), (c) NLR density (max count 87), (d) DNA methylation (5mC modification) density, (e) distribution of KASP markers based on SNPs with bread wheat cv.Chinese Spring 29 and (f) GC content (in %).Tt in chromosome name represents T. timopheevii.Y-axis for tracks c and f have an interval of 18 and 20 units, respectively.

Fig. 3
Fig.3Contact map after the integration of the Hi-C data and manual correction using PretextView.

Table 1 .
Summary statistics for genome assembly of Triticum timopheevii.

Table 3 .
Classification of repeat annotation in Triticum timopheevii.

Table 4 .
Summary statistics for the final structural annotation of the T. timopheevii genome.