Chromosome-scale assembly of the wild wheat relative Aegilops umbellulata

Wild wheat relatives have been explored in plant breeding to increase the genetic diversity of bread wheat, one of the most important food crops. Aegilops umbellulata is a diploid U genome-containing grass species that serves as a genetic reservoir for wheat improvement. In this study, we report the construction of a chromosome-scale reference assembly of Ae. umbellulata accession TA1851 based on corrected PacBio HiFi reads and chromosome conformation capture. The total assembly size was 4.25 Gb with a contig N50 of 17.7 Mb. In total, 36,268 gene models were predicted. We benchmarked the performance of hifiasm and LJA, two of the most widely used assemblers using standard and corrected HiFi reads, revealing a positive effect of corrected input reads. Comparative genome analysis confirmed substantial chromosome rearrangements in Ae. umbellulata compared to bread wheat. In summary, the Ae. umbellulata assembly provides a resource for comparative genomics in Triticeae and for the discovery of agriculturally important genes.

Aegilops umbellulata (2n = 2x = 14, UU genome) is the only diploid Aegilops species containing the U genome (Fig. 1a).Compared to the bread wheat A, B and D genomes, the U genome contains several large chromosome rearrangements.In particular, chromosomes 4U, 6U,and 7U show multiple reciprocal translocations, inversions and intra-chromosomal translocations 7,8 .The U genome is a source of disease resistance genes that have been transferred into wheat, including Lr9, Lr76, Yr70 and PmY39 [9][10][11] .Recently, the leaf rust resistance gene Lr9 has been cloned and found to encode an unusual kinase fusion protein.Ae. umbellulata accession TA1851 was identified as the probable donor of Lr9 12 .In this previous analysis, a contig-level assembly of TA1851 was generated to evaluate the Lr9 translocation in bread wheat.The TA1851 contig-level assembly was based on ~157 Gb (~35-fold coverage) of HiFi reads 13 .
In this current study, we first polished the TA1851 HiFi reads using the DeepConsensus 14 pipeline in order to increase read accuracy and to improve the primary contig-level assembly.We then assembled an Ae.umbellulata chromosome-scale reference genome by integrating chromatin conformation capture (Omni-C) data.CpG methylation along the chromosomes was inferred from the PacBio CCS data.The high-quality Ae. umbellulata assembly obtained in this study provides a reference for the U genome of the Triticeae tribe.It will serve as the basis to study chromosome rearrangements across different Triticeae species and can be explored to detect U genome introgressions in durum and bread wheat.

Plant material, DNA extraction and sequencing.
The DNA extraction and generation of PacBio HiFi reads was described previously 12 .In brief, high molecular weight (HMW) DNA was extracted from young seedlings of Ae. umbellulata accession TA1851 using a modified Qiagen Genomic DNA extraction protocol (https:// doi.org/10.17504/protocols.io.bafmibk6) 15 .DNA was sheared to the appropriate size range (15-20 kb) using Megaruptor 3 (Diagenode) for the construction of PacBio HiFi sequencing libraries.Library preparation was done with the Express Template Prep Kit 2.0 (100-938-900 + Enzyme Clean up 2.0 (101-932-600)), and size was selected with a PippinHT System (Sage Science, HTP0001).Sequencing was performed on PacBio Sequel II systems.The Omni-C library was prepared and sequenced at Cantata Bio using the Dovetail ® Omni-C ® Kit for plant tissues according to the manufacturer's protocol.One library was sequenced on an Illumina MiSeq platform to generate ~776 million 2 × 150 bp read pairs for Ae.umbellulata accession TA1851.

Contig-level assembly benchmarking.
We first compared contig-level assemblies generated by hifiasm 16 and the La Jolla Assembler (LJA) 17 using standard HiFi reads and corrected HiFi reads generated with DeepConsensus 14 .The raw subreads from five SMRT cells were processed using the ccs software (https://github.com/PacificBiosciences/ccs) or DeepConsensus (Table 1).The correction with DeepConsensus produced fewer HiFi data (~157 Gb and ~150 Gb for ccs and DeepConsensus, respectively), but resulted in an increase of the mean read QV (29.9 and 33.1 for ccs and DeepConsensus, respectively) (Table 1).
Contig-level assemblies generated with the different assemblers and data sets were assessed using the basic summary statistics (Table 2).All four assemblies had similar total assembly sizes.For hifiasm, we observed marked increases of contig N50 (11.1 Mb to 14 Mb; + 26%) and contig N90 (3.2 Mb to 3.8 Mb; + 20%) when using corrected HiFi reads (Table 2).Overall, LJA outperformed hifiasm in terms of contiguity.In comparison to hifiasm, DeepConsensus did not result in a considerable increase of contig N50 with LJA, while the contig  showing a 59% and 63% increase in contig N50 and contig N90, respectively, compared to the hifiasm assembly with standard HiFi reads (Table 2).In terms of computational resources, all the contig-level assemblies were performed on a single AMD node using 120 cores.We observed that the memory usage was higher with LJA with an increase of 61% and 20% with the standard and corrected HiFi reads, respectively.The computing time was also considerably higher with LJA (Table 2).Based on the overall performance, the LJA-DeepConsensus contig-level assembly was used to construct a chromosome-scale Ae. umbellulata assembly.

Chromosome-scale assembly. Construction of the pseudomolecules was performed by integrating
Omni-C read data using Juicer (v2; https://github.com/aidenlab/juicer) 18and the 3D-DNA pipeline (https:// github.com/aidenlab/3d-dna) 19.First, to generate the contact maps, Omni-C Illumina short reads were preprocessed with juicer.sh(parameters: -s none-assembly).The output file "merged_nodups.txt"and the primary assembly were then used to produce an assembly with 3D-DNA 19 (using run-asm-pipeline.shwith -r 0 parameter).We used Juicebox (v2.14.00) 20 to visualize the Hi-C contact matrix along the assembly, and to manually curate the assembly.The orientation and the chromosome number of each pseudomolecule were determined based on an existing assembly of Ae. tauschii 21 , a close relative of Ae. umbellulata, using a dotplot comparison produced with chromeister (https://github.com/estebanpw/chromeister) 22.There has been some inconsistency in naming the highly rearranged chromosomes 4U and 6U.We decided to follow the most common nomenclature used in the recent publication of Said, et al. 8 .Contigs not anchored in the pseudomolecules were concatenated into an "unanchored chromosome".The final Hi-C contact maps and assemblies were saved using run-asm-pipeline-post-review.sh from the 3D-DNA pipeline.The genome assembly resulted in seven pseudomolecules and one unanchored chromosome (Fig. 1b; Table 3).
All the output gff files from the lifting and genome-guided approaches were merged into a single file using the perl script "agat_sp_merge_annotations.pl".The merged file was then post-processed using gffread tools (parameters:-keep-genes -N -J) to retain transcripts with start and stop codons, and to discard transcripts with 1) premature stop codons and/or 2) having introns with non-canonical splice sites.In total, 36,268 gene models were predicted for which the putative functional annotations were assigned using a protein comparison against the UniProt database (2021_03) using DIAMOND 38 (parameter: -f 6 -k 1 -e 1e-6).PFAM domain signatures and GO were assigned using InterproScan version 5.55-88.0 39.
The synteny analysis against Ae.tauschii was computed using MCScanX 40 with defaults parameters, which allowed us to identify the main translocation events within the Ae.umbellulata genome (Fig. 1b).
PacBio DNA methylation profile.Methylation in CpG context was inferred with ccsmeth (v0.3.2) 41 , a deep-learning method to detect DNA 5mCpGs by using kinetics features from PacBio CCS reads.The methylation prediction for CCS reads were called using the model "model_ccsmeth_5mCpG_call_mods_attbigru2s_b21.v1.ckpt".Then, the reads with the MM + ML tags were aligned to the pseudomolecules using BWA (v0.7.17) 42 and the subsequent BAM file was filtered for hard/soft clips and quality (MAPQ ≥ 60) using SAMtools (v1.8) 43 .The methylation frequency was calculated at genome level with the modbam files and the aggregate mode of ccsmeth with the model "model_ccsmeth_5mCpG_aggregate_attbigru_b11.v2.ckpt".

Genome visualization.
The genome of Ae. umbellulata accession TA1851 was uploaded into the Persephone ® multi-genome browser (https://web.persephonesoft.com/?data=genomes/TA1851).The data tracks available are the DNA sequence, gene model prediction, and the CpG methylation.A BLAST 36 search and synteny analysis with the hexaploid wheat line Chinese Spring (v.2.1) 44 are also available (Fig. 2).

Data records
The corrected HiFi reads and the raw Omni-C reads were deposited in the Sequence Read Archive at NCBI under accession number ERP147844 45 .The final chromosome assembly was deposited at NCBI under the accession number GCA_032464435.1 46 .

Technical Validation
Assessment of genome assembly and annotation.The Hi-C contact map was manually curated and assessed with Juicebox and showed a dense pattern along the diagonal revealing no potential mis-assemblies (Fig. 3).The anti-diagonals are typical for Triticeae genomes and correspond the Rabl configuration of Triticeae chromosomes 48,49 .Chromosome 6U does not show the anti-diagonal, which is most likely due to the extreme acrocentric nature of this chromosome 50,51 (Fig. 3).
The BUSCO 52 (v5.4.5 -poales_odb10) score of 98% (0.4% fragmented and 1.6% missing BUSCOs) at the genome level indicates a high completeness of the TA1851 assembly.The quality of the Ae.umbellulata assembly was assessed with Merqury 53 based on the PacBio HiFi reads using 19-mers.The QV (consensus quality value) and k-mer completeness scores were 59.3 and 98.1%, respectively.We further determined the LTR Assembly Index (LAI) and obtained a value of 16.42, which corresponds to a reference quality genome 54 .Telomeric repeats (TTTAGGG) n 55,56 were found at the extremities of all the pseudomolecules, except the short arms of chromosomes 1U and 5U,which corresponds to the location of the rDNA loci in Ae. umbellulata 57 .
Completeness of the gene model prediction was evaluated using BUSCO and produced a score of 98.1% (0.3% fragmented and 1.6% missing BUSCOs).The number of predicted gene models (36,268) is in the range of a diploid Triticeae species (34,000-43,000 high-confidence gene models per haploid genome) 58 .

Fig. 2
Fig. 2 Genome visualization with Persephone.(a) Persephone genome browser visualization.The upper panel represents the position along chromosome 3U.The middle panel shows an example of three gene models with their predicted isoforms.In the lower panel, the CpG methylation profile is represented in blue and red for the unmethylated and methylated bases, respectively.(b) Synteny matrix between the seven Ae. umbellulata chromosomes (x-axis) and the 21 chromosomes of the bread wheat line Chinese spring v2.1 (y-axis) (c) Synteny comparison of the highly rearranged Ae. umbellulata chromosome 6U (in central position) in comparison to bread wheat chromosomes 1D, 2D, 4D, 6D and 7D.The links between chromosomes represented orthologous gene relationships.

Fig. 3
Fig. 3 Contact map after the integration of the Omni-C data and manual correction.Green and blue boxes represent contigs and pseudomolecules, respectively.

Table 1 .
Comparison of read quality and yield per SMRT cell between ccs and DeepConsensus pipeline for the generation of HiFi reads.N90 increased by 16% (4.5 Mb to 5.2 Mb).The highest contiguity was observed with LJA and DeepConsensus,

Table 4 .
Classification of repeat annotation in Aegilops umbellulata.