High-quality wild barley genome assemblies and annotation with Nanopore long reads and Hi-C sequencing data

Wild barley, from “Evolution Canyon (EC)” in Mount Carmel, Israel, are ideal models for cereal chromosome evolution studies. Here, the wild barley EC_S1 is from the south slope with higher daily temperatures and drought, while EC_N1 is from the north slope with a cooler climate and higher relative humidity, which results in a differentiated selection due to contrasting environments. We assembled a 5.03 Gb genome with contig N50 of 3.53 Mb for wild barley EC_S1 and a 5.05 Gb genome with contig N50 of 3.45 Mb for EC_N1 using 145 Gb and 160.0 Gb Illumina sequencing data, 295.6 Gb and 285.35 Gb Nanopore sequencing data and 555.1 Gb and 514.5 Gb Hi-C sequencing data, respectively. BUSCOs and CEGMA evaluation suggested highly complete assemblies. Using full-length transcriptome data, we predicted 39,179 and 38,373 high-confidence genes in EC_S1 and EC_N1, in which 93.6% and 95.2% were functionally annotated, respectively. We annotated repetitive elements and non-coding RNAs. These two wild barley genome assemblies will provide a rich gene pool for domesticated barley.


Background & Summary
Barley (Hordeum vulgare L.), the fourth largest crop in terms of total cultivated area worldwide, is one of the earliest domesticated crops 1 .The cultivated barley is believed to be domesticated about 10,000 years ago from the wild progenitor H. spontaneum 2 .Beyond its importance as a major global crop, barley also serves as an invaluable model organism for research into crop domestication and adaptability due to its diploid status, relatively small genome within the Triticeae, and broad environmental adaptability 3,4 .A growing body of research highlights that during domestication, barley's agronomic traits were selectively enhanced for efficient harvesting, maximized yield, and improved grain quality.In contrast, genetic variations crucial for survival under environmental stresses have been diminished or even eradicated 5 , posing significant challenges when breeding new resilient varieties in response to climate and environmental shifts., Wild barley (Hordeum spontaneum K. Koch), the ancestor of cultivated barley, has a wide eco-geographic distribution across highly diverse environments throughout Southwestern Asia 6 .The capacity of wild barley to withstand dry and hot conditions has significant implications for barley breeding, especially considering that a mere 40% of alleles present in wild barley are found within the gene pool of globally cultivated barley 7,8 .The wild barley population, therefore, can contribute a rich reservoir of genes tolerant to drought and heat, which can be introduced into domesticated barley -a feat made possible by the ease with which the two species can crossbreed.This paves the way for breeding cultivars resilient to climate change.
High-quality genome assembly is required for the exploitation of beneficial genetic variants in the wild barley 1,8 .With the advancements in sequencing technology, notable strides have been made in barley genomics.The draft sequence assembly of barley cultivar (cv.)Morex was reported in 2012, and it was further improved in 2017, especially in the centromeric region and highly repetitive region 7,9 , and again with significant improvement in continuity in 2021 10 .Besides, the draft genome and high-quality reference genome of Tibetan hulless barley have been publicly available in 2018, which significantly enriched barley genomic resources 8,11 .Recently, a barley pan-genome study reported the de novo assemblies for 20 representative barley worldwide accessions and revealed abundant structural variations among the genomes 12 , which underscore the need for high-quality wild barley assemblies in comparative genomic studies and future barley breeding initiatives.
The 'Evolution Canyon' model serves as an optimal micro-climatic divergence model between slopes, designed to understand the impact of climate and environmental changes on genomic adaptation and differentiation 13 .The sharp microclimatic divergence between the abutting slopes has been proposed to drive genomic adaptive divergence underpinnings of local adaptation, providing a unique system for comparative genomic study.High-quality genome assemblies of wild barley from micro-climatically contrasting sites can enrich the barley genome resources and provide genomic insight into the relationship between environmental selection and genome evolution.Here, we report two chromosome-scale assemblies for two wild barleys (EC_S1 from the south slope, EC_N1 from the north slope of Evolution Canyon in Mountains of Carmel, Israel), using the Oxford Nanopore long-read sequencing method, Hi-C chromosome conformation capture and Bionano-optical mapping technologies.With BUSCO, CEGMAG and GC-depth analysis, we demonstrate that the two assemblies are of high integrity and accuracy.Using the assemblies, we further predicted their genes, repetitive elements, and non-coding RNAs.The wild barley can provide a rich gene pool for stress-tolerant genes that might be introduced into domesticated barley, and our wild barley genomes will greatly facilitate such endeavours.The wild barley assemblies will also enable comparative genomic studies penetrating genomic evolution and adaptation of barley.

Methods
Sample preparation, library construction and sequencing.Seeds were collected from two samples, EC-S1 and EC-N1, at the South-facing slope and north-facing slope, respectively, of the "Evolution Canyon" in Mount Carmel, Israel, and were germinated and grown in the glasshouse at Yangtze University (Jingzhou, Hubei Province, China).Mature leaves were harvested for DNA extraction and sequencing.Genomic DNA was extracted following the CTAB method and purified with QIAGEN ® Genomic kit (Cat#13343, QIAGEN, Germany).DNA quality and concentration were examined using a NanoDrop TM 8000 spectrophotometer (Thermo Fisher Scientific, USA).DNA concentration was estimated with a Qubit ® 4.0 Fluorometer (Thermo Fisher Scientific, USA).
For long-read sequencing, approximately 3-4 µg DNA per sample was used as input material for the ONT library preparations.After the sample was qualified, size-select of long DNA fragments was performed using the PippinHT system (Sage Science, USA).Next, the ends of DNA fragments were repaired, and A-ligation reactions were conducted with NEBNext Ultra II End Repair/dA-tailing Kit (Cat# E7546).The adapter in the SQK-LSK109 (Oxford Nanopore Technologies, UK) was used for further ligation reaction, and DNA library was measured by Qubit ® 4.0 Fluorometer (Thermo Fisher Scientific, USA).About 700 ng DNA library was con- structed and performed on a Nanopore PromethION sequencer instrument (Oxford Nanopore Technologies, UK) at the Genome Center of Grandomics (Wuhan, China).
A total of 295.6 Gb (~65× coverage of the estimated genome size) subreads in EC_S1 and 285.35Gb (~65× coverage of the estimated genome size) subreads in EC_N1 were yielded for genome assembly.For the Illumina NovaSeq.6000 platform, libraries for Illumina paired-end genome sequencing were constructed using Truseq Nano DNA HT Sample Preparation Kit (Illumina USA) following the standard manufacturer's protocol (Illumina), and then sequenced with a paired-end sequencing strategy.Finally, we obtained 145.0 Gb (~32× coverage of the estimated genome size) in EC_S1 and 160.0 Gb (~36X coverage of the estimated genome size) clean data after quality inspection.For High-through chromosome conformation capture (Hi-C) sequencing, genomic DNA was extracted from the EC_S1 and EC_N1 sample.Thereafter, we constructed the Hi-C library and obtained sequencing data via the Illumina Novaseq/MGI-2000 platform to anchor hybrid scaffolds onto chromosome 14 .After quality control and filtration, 555.1 Gb (~122× coverage of the estimated genome size) clean data in EC_S1 and 514.5 Gb clean data in EC_N1 were obtained for the next analysis.Samples of roots and leaves (and young panicle) at the seedling, tillering and booting stage were used to collect transcriptome data by RNA sequencing for predicting the gene model.
Total RNA was extracted by grinding tissue in TRIzol reagent (TIANGEN, China) on dry ice and processed following the protocol provided by the manufacturer.The integrity of the RNA was determined with the Agilent 2100 Bioanalyzer (Agilent Technologies) and agarose gel electrophoresis.The purity and concentration of the RNA were determined with the Nanodrop TM 8000 spectrophotometer (Thermo Fisher Scientific) and Qubit ® 4.0 Fluorometer (Thermo Fisher Scientific, USA).cDNAs were prepared with DNA damage repair, end repair, and sequencing adapters ligation using SMRTbell Template Prep Kit 1.0 (Pacific Biosciences).The SMRTbell template was annealed to the sequencing primer, bound to polymerase, and sequenced on the PacBio Sequel platform using Sequel Binding Kit 3.0 (Pacific Biosciences) with 20 h movies.Finally, a total of 168.9 Gb clean data in EC_S1 and 111.8 Gb clean data in EC_N1 with filtration was yielded for further analysis.
For BioNano physical mapping, DNA extracted from EC_S1 and EC_N1 were subject to manufacturer-recommended protocols for library preparation (Plant DNA Isolation Kit,80003) and optical scanning provided by BioNano Genomics (https://bionanogenomics.com), with the labeling enzyme Direct Label Enzyme (DLE) (Bionano PrepDLS Labeling DNA Kit,80005).Labelled DNA samples were loaded and run on the Saphyr system (BioNano Genomics).Raw BioNano data were cleaned by removing molecules matching any of the following rules: length less than 150 kb, molecule signal-to-noise ratio less than 2.75, label signal-to-noise ratio less than 2.75, or label intensity greater 0.8.About 443.19 Gb and 311.01 Gb clean data were yielded after filtering with the parameter "Molecule length <150 kb" and "MinSites (/100 kb) <9".

De novo assembly of the wild barley genome.
To ensure reads are reliable, Illumina paired-ended sequenced raw reads for the genomic survey were first filtered using the Fastp v.0.20.0 15 preprocessor (set to default parameters).To understand the genomic characteristics of EC_S1 and EC_N1, the K-mer analysis 16 was performed using Illumina DNA data prior to genome assembly to estimate the genome size and heterozygosity.Briefly, quality-filtered reads were subjected to 17-mer frequency distribution analysis using the Jellyfish program 16 .The genome size was determined based on k-mer frequency distributions, using details from the peak depth and the count of 17-mers.Likewise, the heterozygosity rate was estimated utilizing the count of k-mers at half the peak depth and through simulation analysis using A. thaliana genome data as described in a previous publication 17 .The results indicated that the estimated genome sizes of EC_S1 and EC_N1 were 4.56 Gb and 4.40 Gb, respectively, both displaying low heterozygosity.(Fig. 1).
For de novo genome assembly, an ONT-only assembly was constructed by using an OLC (overlap layout-consensus) 18 /string graph method 19 with NextDenovo.Considering the high error rate of ONT raw reads, the original subreads were first self-corrected using NextCorrect, thus obtaining 190.0 Gb (~38X coverage of the estimated genome size) and 172.8 Gb (~39× coverage of the estimated genome size) consistent sequences (CNS reads) in EC_S1 and EC_N1, respectively.Comparing CNS was then performed with the NextGraph module to capture correlations of CNS.Based on the correlation of CNS, 4.66 Gb preliminary genome with a contig N50 length of 3.26 Mb in EC_S1 and 4.66 Gb preliminary genome with a contig N50 length of 3.17 Mb in EC_N1 were obtained (Table 1).To improve the accuracy of the assembly, we refine the contigs with Racon 20 using ONT long reads and Nextpolish using Illumina short reads with default parameters.Finally, we obtained a polish genome of 5.03 Gb with a contig N50 length of 3.53 Mb in EC_S1 and 5.05 Gb with a contig N50 length of 3.45 Mb in EC_N1 (Table 2).
The completeness of genome assembly was assessed using BUSCO v4.0.5 with single copy homologous genes in embryophyta_odb10 of OrthoDB database (Benchmarking Universal Single Copy Orthologs) 21 and CEGMA v2 (Core Eukaryotic Gene Mapping Approach) 22 .96.2% and 96.3% of complete BUSCOs were found in EC_S1 Fig. 1 The k-mer distribution used to estimate the genome size of the wild barley EC_S1 and EC_N1.The distribution was determined based on the Jellyfish analysis using a k-mer size of 17. and EC_N1, respectively (Fig. S1).In addition, a total of 98.39% core genes in EC_S1 and 97.18% core genes in EC_N1 were detected among 248 core gene collections, suggesting high confidence in genome assembly in both EC_S1 and EC_N1 (Fig. S2).To evaluate the consistency of genome sequence, we aligned the second-generation sequencing data and the third-generation sequencing data to the polish genome by bwa v0.7.12-r1039 23 and minimap2 vr41 24 .The results showed that the average depth of the second-generation sequencing data in the EC_ S1 and EC_ N1 was 28.22 and 31.22,respectively, and the coverage (depth > = 1×) was 85.97 and 84.88%.The average depth of the third-generation sequencing data in EC_ S1 and EC_ N1 was 55.12 and 43.83, respectively, and the coverage (depth > = 1×) was 99.80 and 99.62% (Table 3).GC-depth analysis showed that the GC content was distributed in 40%-50%, and the sequencing depth was concentrated in 40-80× in both EC_ S1 and EC_ N1 assemblies (Fig. S3).The corrected genome sequence was compared with NT library (Nucleotide Sequence Database, downloaded on 3 rd August 2018, https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz) to determine the classification of the sequence, suggesting that there was a small amount of mitochondrial and chloroplast nucleic acids in the sequence but no exogenous pollution (Table 4).

Chromosome assembly by optical mapping and Hi-C data.
De novo assembly of BioNano molecules into genome maps was performed using the script pipelineCL.py in the BioNano Solve package v3.3 (BioNano Genomics).Hybrid scaffolds were assembled from ONT assembly and BioNano genome maps using the script hybridScaffold.pl in the Solve package.Finally, EC_S1's genome super-scaffold size was 5.1 Gb with a scaffold N50 of 90.4 Mb and contig N50 of 1.67 Mb; EC_N1's genome super-scaffold size was 5.2 Gb with scaffold N50 of 43.7 Mb and contig N50 of 1.59 Mb (Table 5).Compared to previously published barley genome assemblies, current genomes assemblies showed a great improvement in contig N50 and scaffold N50 (Table S1), and their quality was closed to the new version genome of Morex assembled by PacBio long-read (Table S2) 1,12 .For Hi-C auxiliary assembly, a total of 3.83 billion paired-end reads were generated from the libraries of EC_S1 and 3.54 billion from EC_N1.Then, quality controlling of Hi-C raw data was performed using HiC-Pro 25   research.Firstly, low-quality sequences (quality scores < 20), adaptor sequences and sequences shorter than 30 bp were filtered out using Fastp 15 .The clean paired-end reads were then mapped to the draft assembled sequence using bowtie2 v2.3.2 26 to get 759 million (39.63%) unique mapped paired-end reads in EC_S1 and 581 million (33.81%) in EC_N1.About 586 million (30.62%) valid interaction paired reads in EC_S1 and 436 million in EC_N1 were identified and retained by HiC-Pro 25 from unique mapped paired-end reads for further analysis (Table 6).Invalid read pairs, including dangling-end, self-cycle, re-ligation, and dumped products, were filtered by HiC-Pro 25 .The 5.07 Gb scaffolds (98.58%) in EC_S1 and 5.10 Gb scaffolds (97.16%) in EC_N1 were further clustered, ordered, and oriented scaffolds onto the seven chromosomes by LACHESIS 27 , respectively (Table 7).
According to the resulting Hi-C contact heatmap, mis-assemblies and mis-joins were manually corrected based on neighbouring interactions.The final assemblies were aligned to the previously reported barley genome assemblies of wild barley B1K-04-12 and cultivated barley Morex 12 by Mummer v4.0 28 .Then the raw alignments results were further filtered by delta-filter from Mummer software 28 .The results were visualized by NGenomeSyn v2.0 29 , which demonstrated high collinearity across the majority of all chromosome regions.Furthermore, we identified abundant structural variations (SVs), including large fragment inversions (INVs) and insertions or deletions (INDELs) (refer to Fig. 2 and Fig. S4).These findings enrich the diversity of the barley genome resource.

Gene model prediction and functional annotation.
We first annotated the tandem repeats using the software GMATA 30 and Tandem Repeats Finder (TRF) 31 , where GMATA identifies the simple repeats sequences (SSRs) and TRF recognizes all tandem repeat elements in the whole genome.Transposable elements (TE) in the EC_S1 and EC_N1 genomes were then identified using a combination of ab initio and homology-based methods.
For further identification of the repeats throughout the genome, RepeatMasker v2.0.1 32 was applied to search for known and novel TEs by mapping sequences against the de novo repeat library and Repbase TE library (version 20180826) 33 .Overlapping transposable elements belonging to the same repeat class were collated and combined.The repeat elements were annotated and shown in Table 8.Three independent approaches, including ab initio prediction, homology search, and reference guided transcriptome assembly, were used for gene prediction in a repeat-masked genome 34 .In detail, GeMoMa v1.3.1 35 was used to align the homologous protein sequences from related species to the assembly and then got the gene structure information, which was homolog prediction.For RNA-seq based gene prediction, filtered mRNA-seq reads were aligned to the reference genome using STAR (default) 36 .The transcripts were then assembled using Stringtie v2.1.4 37and open reading frames (ORFs) were predicted using Program to Assemble Spliced Alignments (PASA) 38 .For the de novo prediction, RNA-seq reads were de novo assembled using Stringtie and analyzed with PASA to produce a training set.Augustus v2.5.5 39 with default parameters was then utilized for ab initio gene prediction with the training set.Finally, EVidenceModeler (EVM) 40 was used to produce an integrated gene set of which genes with TE were removed using Transposon PSI package 41 and the miscoded genes were further filtered.According to Mascher et al. 7 , high-confidence (HC) gene was defined as genes that had a significant BLAST hit to reference proteins and representative proteins had a similarity to the respective template sequence above a threshold which was determined on the basis of the origin of template sequences (>60% for Arabidopsis thaliana, sorghum and rice, >65% for Brachypodium distachyon, and >85% for barley).Finally, a total of 39,179 high-confidence and 20,936 low-confidence protein-coding genes were identified in EC_S1 genome, and 38,373 high-confidence and 20,243 low-confidence protein-coding genes in EC_N1 (Table 9).

Samples
Gene functional information, motifs and domains of their proteins were assigned by comparing with public databases including SwissProt 42 , NCBI non-redundant protein sequences (nr) 43 , Kyoto Encyclopedia of Genes and Genomes (KEGG) 44 , Clusters of orthologous groups for eukaryotic complete genomes (KOG) 45 and Gene Ontology (GO) 46 .The putative domains and GO terms of genes were identified using the InterProScan program 47 with default parameters.For the other four databases, BLASTp 48 was used to compare the EvidenceModeler-integrated protein sequences against the four well-known public protein databases with an E-value cutoff of 1e-05 and the results with the hit with the lowest E value were retained.Results from the five database searches were concatenated, leading to a total of 56,261 (93.59%) genes in EC_S1 and 55,772 (95.15) genes in EC_N1 with function annotation (Table 9).annotation of non-coding RNa genes.To obtain the ncRNA (non-coding RNA), we used two strategies: searching against a database and predicting with a model.Transfer RNAs (tRNAs) were predicted using tRNAscan-SE v2.0.6 49 with eukaryote parameters.MicroRNA, rRNA, small nuclear RNA, and small nucleolar RNA were detected using Infernal cmscan 50 to search the Rfam database 51 .The rRNAs and their subunits were LG01  Fig. 2 The collinearity analysis among assemblies of EC_S1, EC_N1, B1K-04-12 and Morex.

Data Records
The EC_S1 and EC_N1 genome sequence are available at NCBI database under Bioproject accession PRJNA947680 53,54 .RNA-seq (The samples' information are showed in Table S3), NGS, Hi-C, and Nanopore data sets are available at NCBI under Bioproject accession PRJNA748178 55 .Bionano data sets are available at NCBI Supplementary Files under accession SUPPF_0000004010 (EC_S1) and SUPPF_0000004011 (EC_N1) 55 .The genome annotation GFF3, CDS sequences, and protein sequences are available at figshare 56 .

technical Validation
DNa and RNa integrity.The quality of DNA and RNA molecules and libraries was examined before genome and transcriptome sequencing.The DNA degradation and contamination of the extracted DNA were monitored on 1% agarose gels.DNA purity was then inspected using NanoDrop ™ 8000 spectrophotometer (Thermo Fisher Scientific, USA), of which OD260/280 ranged from 1.8 to 2.0 and OD 260/230 was between 2.0 to 2.2.Finally, DNA concentration was further measured by Qubit ® 4.0 Fluorometer (Thermo Fisher Scientific, USA).The integrity of the RNA was determined with the Agilent 2100 Bioanalyzer (Agilent Technologies) and agarose gel electrophoresis.The purity and concentration of the RNA were determined with the Nanodrop TM 8000 spectrophotometer (Thermo Fisher Scientific, USA) and Qubit ® 4.0 Fluorometer (Thermo Fisher Scientific, USA).Only the high-quality RNA sample (OD260/280 = 1.8~2.2,OD260/230 ≥ 2.0, RIN ≥ 7, >1 μg) was used to construct the sequencing library.
assessment of the genome assembly.After using BUSCO and CEGMA to evaluate genome integrity, we have also evaluated the accuracy of the genome.All the Illumina paired-end reads were mapped to the assembled genome using bwa 0.7.12-r1039 (default) 22 , and the mapping rate, as well as genome coverage of sequencing reads were assessed.Then samtools v1.4 57 and bcftools v2.29.2 58 were used to calculate the homozygous and heterozygous mutation sites corresponding to the samples.Homozygous sites were regarded as genomic error sites to calculate the single base error rate.The accuracy of genomic single base was 99.997% (depth > = 5x) in EC_S1 and 99.996% (depth > = 5x) in EC_N1.The Minimap2 r41 (-x map-ont) 23 was used to map all long-reads back  to the genome, to calculate mapping rate, coverage, and GC content.the draft genome assemblies were submitted to the NT library (Nucleotide Sequence Database, downloaded on 3 rd August, 2018, https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz) and aligned sequences were eliminated to remove the mitochondria sequences in the assemblies.The results showed that most of the sequences were aligned with the target species, indicating that there was no external contamination in the assembled genome.Finally, the seven chromosomes of EC_S1 and EC_N1 assemblies were evaluated.The genome with chromosomes aligned by Hi-C data was divided into 'bin' (in a length of 100 KB).The number of Hi-C read pairs covered by any two 'bins' was used to define the signal for the interaction between those 'bins' 27 , and the heat map of Hi-C interaction of chromosomes was made by HiCPlotter.py script in Python v2.7 (Fig. 3).This figure shows that the intensity of interaction in the diagonal position was higher than that in the non-diagonal position, and there was no obvious noise outside the diagonal, indicating that the chromosomes assembly of both EC_S1 and EC_N1 were high-quality.

Table 1 .
Statistics of EC_S1 and EC_N1 preliminary genome assembly.
as in previous

Table 2 .
Statistics of the EC_S1 and EC_N1 polished genome assembly.

Table 3 .
Genome sequence consistency and coverage.

Table 4 .
Genomic contamination assessment of EC_S1 and EC_N1 assemblies.

Table 5 .
Statistics of scaffold constructed by BioNano in EC_S1 and EC_N1.

Table 6 .
Valid paired end reads statistics of Hi-C data.

Table 7 .
Chromosome length assembled by Hi-C data in EC_S1 and EC_N1.

Table 8 .
Characterization of wild barley TE annotation in wild barley EC_S1 and EC_N1.

Table 9 .
The summary of gene annotation in the EC_S1 and EC_N1 assemblies.
addition, 1913 and 2039 tRNA were detected in EC_S1 and EC_N1, covering all 20 anti-codons types of amino acids (Table

Table 10 .
Summary of non-coding RNA in the EC_S1 and EC_N1 assemblies.