High-quality wild barley genome assemblies and annotation with Nanopore long reads and Hi-C sequencing data

Pan, Rui; Hu, Haifei; Xiao, Yuhui; Xu, Le; Xu, Yanhao; Ouyang, Kai; Li, Chengdao; He, Tianhua; Zhang, Wenying

doi:10.1038/s41597-023-02434-2

Download PDF

Data Descriptor
Open access
Published: 10 August 2023

High-quality wild barley genome assemblies and annotation with Nanopore long reads and Hi-C sequencing data

Rui Pan¹,
Haifei Hu ORCID: orcid.org/0000-0003-1070-213X^2,3,
Yuhui Xiao⁴,
Le Xu^1,5,
Yanhao Xu^1,5,
Kai Ouyang⁴,
Chengdao Li ORCID: orcid.org/0000-0002-9653-2700^2,6,
Tianhua He² &
…
Wenying Zhang ORCID: orcid.org/0000-0002-9926-4115^1,7

Scientific Data volume 10, Article number: 535 (2023) Cite this article

2201 Accesses
1 Citations
7 Altmetric
Metrics details

Subjects

Abstract

Wild barley, from “Evolution Canyon (EC)” in Mount Carmel, Israel, are ideal models for cereal chromosome evolution studies. Here, the wild barley EC_S1 is from the south slope with higher daily temperatures and drought, while EC_N1 is from the north slope with a cooler climate and higher relative humidity, which results in a differentiated selection due to contrasting environments. We assembled a 5.03 Gb genome with contig N50 of 3.53 Mb for wild barley EC_S1 and a 5.05 Gb genome with contig N50 of 3.45 Mb for EC_N1 using 145 Gb and 160.0 Gb Illumina sequencing data, 295.6 Gb and 285.35 Gb Nanopore sequencing data and 555.1 Gb and 514.5 Gb Hi-C sequencing data, respectively. BUSCOs and CEGMA evaluation suggested highly complete assemblies. Using full-length transcriptome data, we predicted 39,179 and 38,373 high-confidence genes in EC_S1 and EC_N1, in which 93.6% and 95.2% were functionally annotated, respectively. We annotated repetitive elements and non-coding RNAs. These two wild barley genome assemblies will provide a rich gene pool for domesticated barley.

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Genome assembly in the telomere-to-telomere era

Article 22 April 2024

A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range

Article Open access 11 April 2024

Background & Summary

Barley (Hordeum vulgare L.), the fourth largest crop in terms of total cultivated area worldwide, is one of the earliest domesticated crops¹. The cultivated barley is believed to be domesticated about 10,000 years ago from the wild progenitor H. spontaneum². Beyond its importance as a major global crop, barley also serves as an invaluable model organism for research into crop domestication and adaptability due to its diploid status, relatively small genome within the Triticeae, and broad environmental adaptability^3,4. A growing body of research highlights that during domestication, barley’s agronomic traits were selectively enhanced for efficient harvesting, maximized yield, and improved grain quality. In contrast, genetic variations crucial for survival under environmental stresses have been diminished or even eradicated⁵, posing significant challenges when breeding new resilient varieties in response to climate and environmental shifts., Wild barley (Hordeum spontaneum K. Koch), the ancestor of cultivated barley, has a wide eco-geographic distribution across highly diverse environments throughout Southwestern Asia⁶. The capacity of wild barley to withstand dry and hot conditions has significant implications for barley breeding, especially considering that a mere 40% of alleles present in wild barley are found within the gene pool of globally cultivated barley^7,8. The wild barley population, therefore, can contribute a rich reservoir of genes tolerant to drought and heat, which can be introduced into domesticated barley - a feat made possible by the ease with which the two species can crossbreed. This paves the way for breeding cultivars resilient to climate change.

High-quality genome assembly is required for the exploitation of beneficial genetic variants in the wild barley^1,8. With the advancements in sequencing technology, notable strides have been made in barley genomics. The draft sequence assembly of barley cultivar (cv.) Morex was reported in 2012, and it was further improved in 2017, especially in the centromeric region and highly repetitive region^7,9, and again with significant improvement in continuity in 2021¹⁰. Besides, the draft genome and high-quality reference genome of Tibetan hulless barley have been publicly available in 2018, which significantly enriched barley genomic resources^8,11. Recently, a barley pan-genome study reported the de novo assemblies for 20 representative barley worldwide accessions and revealed abundant structural variations among the genomes¹², which underscore the need for high-quality wild barley assemblies in comparative genomic studies and future barley breeding initiatives.

The ‘Evolution Canyon’ model serves as an optimal micro-climatic divergence model between slopes, designed to understand the impact of climate and environmental changes on genomic adaptation and differentiation¹³. The sharp microclimatic divergence between the abutting slopes has been proposed to drive genomic adaptive divergence underpinnings of local adaptation, providing a unique system for comparative genomic study. High-quality genome assemblies of wild barley from micro-climatically contrasting sites can enrich the barley genome resources and provide genomic insight into the relationship between environmental selection and genome evolution. Here, we report two chromosome-scale assemblies for two wild barleys (EC_S1 from the south slope, EC_N1 from the north slope of Evolution Canyon in Mountains of Carmel, Israel), using the Oxford Nanopore long-read sequencing method, Hi-C chromosome conformation capture and Bionano-optical mapping technologies. With BUSCO, CEGMAG and GC-depth analysis, we demonstrate that the two assemblies are of high integrity and accuracy. Using the assemblies, we further predicted their genes, repetitive elements, and non-coding RNAs. The wild barley can provide a rich gene pool for stress-tolerant genes that might be introduced into domesticated barley, and our wild barley genomes will greatly facilitate such endeavours. The wild barley assemblies will also enable comparative genomic studies penetrating genomic evolution and adaptation of barley.

Methods

Sample preparation, library construction and sequencing

Seeds were collected from two samples, EC-S1 and EC-N1, at the South-facing slope and north-facing slope, respectively, of the “Evolution Canyon” in Mount Carmel, Israel, and were germinated and grown in the glasshouse at Yangtze University (Jingzhou, Hubei Province, China). Mature leaves were harvested for DNA extraction and sequencing. Genomic DNA was extracted following the CTAB method and purified with QIAGEN® Genomic kit (Cat#13343, QIAGEN, Germany). DNA quality and concentration were examined using a NanoDrop^TM 8000 spectrophotometer (Thermo Fisher Scientific, USA). DNA concentration was estimated with a Qubit® 4.0 Fluorometer (Thermo Fisher Scientific, USA).

For long-read sequencing, approximately 3–4 µg DNA per sample was used as input material for the ONT library preparations. After the sample was qualified, size-select of long DNA fragments was performed using the PippinHT system (Sage Science, USA). Next, the ends of DNA fragments were repaired, and A-ligation reactions were conducted with NEBNext Ultra II End Repair/dA-tailing Kit (Cat# E7546). The adapter in the SQK-LSK109 (Oxford Nanopore Technologies, UK) was used for further ligation reaction, and DNA library was measured by Qubit® 4.0 Fluorometer (Thermo Fisher Scientific, USA). About 700 ng DNA library was constructed and performed on a Nanopore PromethION sequencer instrument (Oxford Nanopore Technologies, UK) at the Genome Center of Grandomics (Wuhan, China).

A total of 295.6 Gb (~65× coverage of the estimated genome size) subreads in EC_S1 and 285.35 Gb (~65× coverage of the estimated genome size) subreads in EC_N1 were yielded for genome assembly. For the Illumina NovaSeq. 6000 platform, libraries for Illumina paired-end genome sequencing were constructed using Truseq Nano DNA HT Sample Preparation Kit (Illumina USA) following the standard manufacturer’s protocol (Illumina), and then sequenced with a paired-end sequencing strategy. Finally, we obtained 145.0 Gb (~32× coverage of the estimated genome size) in EC_S1 and 160.0 Gb (~36X coverage of the estimated genome size) clean data after quality inspection. For High-through chromosome conformation capture (Hi-C) sequencing, genomic DNA was extracted from the EC_S1 and EC_N1 sample. Thereafter, we constructed the Hi-C library and obtained sequencing data via the Illumina Novaseq/MGI-2000 platform to anchor hybrid scaffolds onto chromosome¹⁴. After quality control and filtration, 555.1 Gb (~122× coverage of the estimated genome size) clean data in EC_S1 and 514.5 Gb clean data in EC_N1 were obtained for the next analysis. Samples of roots and leaves (and young panicle) at the seedling, tillering and booting stage were used to collect transcriptome data by RNA sequencing for predicting the gene model.

Total RNA was extracted by grinding tissue in TRIzol reagent (TIANGEN, China) on dry ice and processed following the protocol provided by the manufacturer. The integrity of the RNA was determined with the Agilent 2100 Bioanalyzer (Agilent Technologies) and agarose gel electrophoresis. The purity and concentration of the RNA were determined with the Nanodrop^TM 8000 spectrophotometer (Thermo Fisher Scientific) and Qubit® 4.0 Fluorometer (Thermo Fisher Scientific, USA). cDNAs were prepared with DNA damage repair, end repair, and sequencing adapters ligation using SMRTbell Template Prep Kit 1.0 (Pacific Biosciences). The SMRTbell template was annealed to the sequencing primer, bound to polymerase, and sequenced on the PacBio Sequel platform using Sequel Binding Kit 3.0 (Pacific Biosciences) with 20 h movies. Finally, a total of 168.9 Gb clean data in EC_S1 and 111.8 Gb clean data in EC_N1 with filtration was yielded for further analysis.

For BioNano physical mapping, DNA extracted from EC_S1 and EC_N1 were subject to manufacturer-recommended protocols for library preparation (Plant DNA Isolation Kit,80003) and optical scanning provided by BioNano Genomics (https://bionanogenomics.com), with the labeling enzyme Direct Label Enzyme (DLE) (Bionano PrepDLS Labeling DNA Kit,80005). Labelled DNA samples were loaded and run on the Saphyr system (BioNano Genomics). Raw BioNano data were cleaned by removing molecules matching any of the following rules: length less than 150 kb, molecule signal-to-noise ratio less than 2.75, label signal-to-noise ratio less than 2.75, or label intensity greater 0.8. About 443.19 Gb and 311.01 Gb clean data were yielded after filtering with the parameter “Molecule length <150 kb” and “MinSites (/100 kb) <9”.

De novo assembly of the wild barley genome

To ensure reads are reliable, Illumina paired-ended sequenced raw reads for the genomic survey were first filtered using the Fastp v.0.20.0¹⁵ preprocessor (set to default parameters). To understand the genomic characteristics of EC_S1 and EC_N1, the K-mer analysis¹⁶ was performed using Illumina DNA data prior to genome assembly to estimate the genome size and heterozygosity. Briefly, quality-filtered reads were subjected to 17-mer frequency distribution analysis using the Jellyfish program¹⁶. The genome size was determined based on k-mer frequency distributions, using details from the peak depth and the count of 17-mers. Likewise, the heterozygosity rate was estimated utilizing the count of k-mers at half the peak depth and through simulation analysis using A. thaliana genome data as described in a previous publication¹⁷. The results indicated that the estimated genome sizes of EC_S1 and EC_N1 were 4.56 Gb and 4.40 Gb, respectively, both displaying low heterozygosity. (Fig. 1).

For de novo genome assembly, an ONT-only assembly was constructed by using an OLC (overlap layout-consensus)¹⁸/string graph method¹⁹ with NextDenovo. Considering the high error rate of ONT raw reads, the original subreads were first self-corrected using NextCorrect, thus obtaining 190.0 Gb (~38X coverage of the estimated genome size) and 172.8 Gb (~39× coverage of the estimated genome size) consistent sequences (CNS reads) in EC_S1 and EC_N1, respectively. Comparing CNS was then performed with the NextGraph module to capture correlations of CNS. Based on the correlation of CNS, 4.66 Gb preliminary genome with a contig N50 length of 3.26 Mb in EC_S1 and 4.66 Gb preliminary genome with a contig N50 length of 3.17 Mb in EC_N1 were obtained (Table 1). To improve the accuracy of the assembly, we refine the contigs with Racon²⁰ using ONT long reads and Nextpolish using Illumina short reads with default parameters. Finally, we obtained a polish genome of 5.03 Gb with a contig N50 length of 3.53 Mb in EC_S1 and 5.05 Gb with a contig N50 length of 3.45 Mb in EC_N1 (Table 2).

Table 1 Statistics of EC_S1 and EC_N1 preliminary genome assembly.

Full size table

Table 2 Statistics of the EC_S1 and EC_N1 polished genome assembly.

Full size table

The completeness of genome assembly was assessed using BUSCO v4.0.5 with single copy homologous genes in embryophyta_odb10 of OrthoDB database (Benchmarking Universal Single Copy Orthologs)²¹ and CEGMA v2 (Core Eukaryotic Gene Mapping Approach)²². 96.2% and 96.3% of complete BUSCOs were found in EC_S1 and EC_N1, respectively (Fig. S1). In addition, a total of 98.39% core genes in EC_S1 and 97.18% core genes in EC_N1 were detected among 248 core gene collections, suggesting high confidence in genome assembly in both EC_S1 and EC_N1 (Fig. S2). To evaluate the consistency of genome sequence, we aligned the second-generation sequencing data and the third-generation sequencing data to the polish genome by bwa v0.7.12-r1039²³ and minimap2 vr41²⁴. The results showed that the average depth of the second-generation sequencing data in the EC_ S1 and EC_ N1 was 28.22 and 31.22, respectively, and the coverage (depth > = 1×) was 85.97 and 84.88%. The average depth of the third-generation sequencing data in EC_ S1 and EC_ N1 was 55.12 and 43.83, respectively, and the coverage (depth > = 1×) was 99.80 and 99.62% (Table 3). GC-depth analysis showed that the GC content was distributed in 40%–50%, and the sequencing depth was concentrated in 40–80× in both EC_ S1 and EC_ N1 assemblies (Fig. S3). The corrected genome sequence was compared with NT library (Nucleotide Sequence Database, downloaded on 3^rd August 2018, https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz) to determine the classification of the sequence, suggesting that there was a small amount of mitochondrial and chloroplast nucleic acids in the sequence but no exogenous pollution (Table 4).

Table 3 Genome sequence consistency and coverage.

Full size table

Table 4 Genomic contamination assessment of EC_S1 and EC_N1 assemblies.

Full size table

Chromosome assembly by optical mapping and Hi-C data

De novo assembly of BioNano molecules into genome maps was performed using the script pipelineCL.py in the BioNano Solve package v3.3 (BioNano Genomics). Hybrid scaffolds were assembled from ONT assembly and BioNano genome maps using the script hybridScaffold.pl in the Solve package. Finally, EC_S1’s genome super-scaffold size was 5.1 Gb with a scaffold N50 of 90.4 Mb and contig N50 of 1.67 Mb; EC_N1’s genome super-scaffold size was 5.2 Gb with scaffold N50 of 43.7 Mb and contig N50 of 1.59 Mb (Table 5). Compared to previously published barley genome assemblies, current genomes assemblies showed a great improvement in contig N50 and scaffold N50 (Table S1), and their quality was closed to the new version genome of Morex assembled by PacBio long-read (Table S2)^1,12. For Hi-C auxiliary assembly, a total of 3.83 billion paired-end reads were generated from the libraries of EC_S1 and 3.54 billion from EC_N1. Then, quality controlling of Hi-C raw data was performed using HiC-Pro²⁵ as in previous research. Firstly, low-quality sequences (quality scores < 20), adaptor sequences and sequences shorter than 30 bp were filtered out using Fastp¹⁵. The clean paired-end reads were then mapped to the draft assembled sequence using bowtie2 v2.3.2²⁶ to get 759 million (39.63%) unique mapped paired-end reads in EC_S1 and 581 million (33.81%) in EC_N1. About 586 million (30.62%) valid interaction paired reads in EC_S1 and 436 million in EC_N1 were identified and retained by HiC-Pro²⁵ from unique mapped paired-end reads for further analysis (Table 6). Invalid read pairs, including dangling-end, self-cycle, re-ligation, and dumped products, were filtered by HiC-Pro²⁵. The 5.07 Gb scaffolds (98.58%) in EC_S1 and 5.10 Gb scaffolds (97.16%) in EC_N1 were further clustered, ordered, and oriented scaffolds onto the seven chromosomes by LACHESIS²⁷, respectively (Table 7). According to the resulting Hi-C contact heatmap, mis-assemblies and mis-joins were manually corrected based on neighbouring interactions. The final assemblies were aligned to the previously reported barley genome assemblies of wild barley B1K-04–12 and cultivated barley Morex¹² by Mummer v4.0²⁸. Then the raw alignments results were further filtered by delta-filter from Mummer software²⁸. The results were visualized by NGenomeSyn v2.0²⁹, which demonstrated high collinearity across the majority of all chromosome regions. Furthermore, we identified abundant structural variations (SVs), including large fragment inversions (INVs) and insertions or deletions (INDELs) (refer to Fig. 2 and Fig. S4). These findings enrich the diversity of the barley genome resource.

Table 5 Statistics of scaffold constructed by BioNano in EC_S1 and EC_N1.

Full size table

Table 6 Valid paired end reads statistics of Hi-C data.

Full size table

Table 7 Chromosome length assembled by Hi-C data in EC_S1 and EC_N1.

Full size table

Gene model prediction and functional annotation

We first annotated the tandem repeats using the software GMATA³⁰ and Tandem Repeats Finder (TRF)³¹, where GMATA identifies the simple repeats sequences (SSRs) and TRF recognizes all tandem repeat elements in the whole genome. Transposable elements (TE) in the EC_S1 and EC_N1 genomes were then identified using a combination of ab initio and homology-based methods. For further identification of the repeats throughout the genome, RepeatMasker v2.0.1³² was applied to search for known and novel TEs by mapping sequences against the de novo repeat library and Repbase TE library (version 20180826)³³. Overlapping transposable elements belonging to the same repeat class were collated and combined. The repeat elements were annotated and shown in Table 8.

Table 8 Characterization of wild barley TE annotation in wild barley EC_S1 and EC_N1.

Full size table

Three independent approaches, including ab initio prediction, homology search, and reference guided transcriptome assembly, were used for gene prediction in a repeat-masked genome³⁴. In detail, GeMoMa v1.3.1³⁵ was used to align the homologous protein sequences from related species to the assembly and then got the gene structure information, which was homolog prediction. For RNA-seq based gene prediction, filtered mRNA-seq reads were aligned to the reference genome using STAR (default)³⁶. The transcripts were then assembled using Stringtie v2.1.4³⁷ and open reading frames (ORFs) were predicted using Program to Assemble Spliced Alignments (PASA)³⁸. For the de novo prediction, RNA-seq reads were de novo assembled using Stringtie and analyzed with PASA to produce a training set. Augustus v2.5.5³⁹ with default parameters was then utilized for ab initio gene prediction with the training set. Finally, EVidenceModeler (EVM)⁴⁰ was used to produce an integrated gene set of which genes with TE were removed using Transposon PSI package⁴¹ and the miscoded genes were further filtered. According to Mascher et al.⁷, high-confidence (HC) gene was defined as genes that had a significant BLAST hit to reference proteins and representative proteins had a similarity to the respective template sequence above a threshold which was determined on the basis of the origin of template sequences (>60% for Arabidopsis thaliana, sorghum and rice, >65% for Brachypodium distachyon, and >85% for barley). Finally, a total of 39,179 high-confidence and 20,936 low-confidence protein-coding genes were identified in EC_S1 genome, and 38,373 high-confidence and 20,243 low-confidence protein-coding genes in EC_N1 (Table 9).

Table 9 The summary of gene annotation in the EC_S1 and EC_N1 assemblies.

Full size table

Gene functional information, motifs and domains of their proteins were assigned by comparing with public databases including SwissProt⁴², NCBI non-redundant protein sequences (nr)⁴³, Kyoto Encyclopedia of Genes and Genomes (KEGG)⁴⁴, Clusters of orthologous groups for eukaryotic complete genomes (KOG)⁴⁵ and Gene Ontology (GO)⁴⁶. The putative domains and GO terms of genes were identified using the InterProScan program⁴⁷ with default parameters. For the other four databases, BLASTp⁴⁸ was used to compare the EvidenceModeler-integrated protein sequences against the four well-known public protein databases with an E-value cutoff of 1e-05 and the results with the hit with the lowest E value were retained. Results from the five database searches were concatenated, leading to a total of 56,261 (93.59%) genes in EC_S1 and 55,772 (95.15) genes in EC_N1 with function annotation (Table 9).

Annotation of non-coding RNA genes

To obtain the ncRNA (non-coding RNA), we used two strategies: searching against a database and predicting with a model. Transfer RNAs (tRNAs) were predicted using tRNAscan-SE v2.0.6⁴⁹ with eukaryote parameters. MicroRNA, rRNA, small nuclear RNA, and small nucleolar RNA were detected using Infernal cmscan⁵⁰ to search the Rfam database⁵¹. The rRNAs and their subunits were predicted using RNAmmer⁵². Finally, a total of 1,163 and 888 rRNA was identified in EC_S1 and EC_N1, respectively. Moreover, total of 7770 ncRNA was identified in EC_S1, including 1180 snRNA (0.0024%), 6188 miRNA (0.0158%), 229 spliceosomal (0.0007%) and 173 other (0.0005%); 7701 ncRNA was identified in EC_N1, including 1065 snRNA (0.0021%), 6246 miRNA (0.0156%), 225 spliceosomal (0.0007%) and 165 other (0.0005%). In addition, 1913 and 2039 tRNA were detected in EC_S1 and EC_N1, covering all 20 anti-codons types of amino acids (Table 10).

Table 10 Summary of non-coding RNA in the EC_S1 and EC_N1 assemblies.

Full size table

Data Records

The EC_S1 and EC_N1 genome sequence are available at NCBI database under Bioproject accession PRJNA947680^53,54. RNA-seq (The samples’ information are showed in Table S3), NGS, Hi-C, and Nanopore data sets are available at NCBI under Bioproject accession PRJNA748178⁵⁵. Bionano data sets are available at NCBI Supplementary Files under accession SUPPF_0000004010 (EC_S1) and SUPPF_0000004011 (EC_N1)⁵⁵. The genome annotation GFF3, CDS sequences, and protein sequences are available at figshare⁵⁶.

Technical Validation

DNA and RNA integrity

The quality of DNA and RNA molecules and libraries was examined before genome and transcriptome sequencing. The DNA degradation and contamination of the extracted DNA were monitored on 1% agarose gels. DNA purity was then inspected using NanoDrop™ 8000 spectrophotometer (Thermo Fisher Scientific, USA), of which OD260/280 ranged from 1.8 to 2.0 and OD 260/230 was between 2.0 to 2.2. Finally, DNA concentration was further measured by Qubit® 4.0 Fluorometer (Thermo Fisher Scientific, USA). The integrity of the RNA was determined with the Agilent 2100 Bioanalyzer (Agilent Technologies) and agarose gel electrophoresis. The purity and concentration of the RNA were determined with the Nanodrop^TM 8000 spectrophotometer (Thermo Fisher Scientific, USA) and Qubit® 4.0 Fluorometer (Thermo Fisher Scientific, USA). Only the high-quality RNA sample (OD260/280 = 1.8~2.2, OD260/230 ≥ 2.0, RIN ≥ 7, >1 μg) was used to construct the sequencing library.

Assessment of the genome assembly

After using BUSCO and CEGMA to evaluate genome integrity, we have also evaluated the accuracy of the genome. All the Illumina paired-end reads were mapped to the assembled genome using bwa 0.7.12-r1039 (default)²², and the mapping rate, as well as genome coverage of sequencing reads were assessed. Then samtools v1.4⁵⁷ and bcftools v2.29.2⁵⁸ were used to calculate the homozygous and heterozygous mutation sites corresponding to the samples. Homozygous sites were regarded as genomic error sites to calculate the single base error rate. The accuracy of genomic single base was 99.997% (depth > = 5x) in EC_S1 and 99.996% (depth > = 5x) in EC_N1. The Minimap2 r41 (-x map-ont)²³ was used to map all long-reads back to the genome, to calculate mapping rate, coverage, and GC content. the draft genome assemblies were submitted to the NT library (Nucleotide Sequence Database, downloaded on 3^rd August, 2018, https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz) and aligned sequences were eliminated to remove the mitochondria sequences in the assemblies. The results showed that most of the sequences were aligned with the target species, indicating that there was no external contamination in the assembled genome.

Finally, the seven chromosomes of EC_S1 and EC_N1 assemblies were evaluated. The genome with chromosomes aligned by Hi-C data was divided into ‘bin’ (in a length of 100 KB). The number of Hi-C read pairs covered by any two ‘bins’ was used to define the signal for the interaction between those ‘bins’²⁷, and the heat map of Hi-C interaction of chromosomes was made by HiCPlotter.py script in Python v2.7 (Fig. 3). This figure shows that the intensity of interaction in the diagonal position was higher than that in the non-diagonal position, and there was no obvious noise outside the diagonal, indicating that the chromosomes assembly of both EC_S1 and EC_N1 were high-quality.

Code availability

No specific code or script was used in this work. All commands used in the processing were executed according to the manual and protocols of the corresponding bioinformatics software.

References

Liu, M. et al. The draft genome of a wild barley genotype reveals its enrichment in genes related to biotic and abiotic stresses compared to cultivated barley. Plant Biotechnol. J. 18, 443–456 (2020).
Article CAS PubMed Google Scholar
Jonathan, B. & Blattner, F. R. Species-level phylogeny and polyploid relationships in Hordeum (Poaceae) inferred by next-generation sequencing and in silico cloning of multiple nuclear loci. Syst. Biol. 644, 792–808 (2015).
Google Scholar
Mayer, K. F. X. et al. Unlocking the barley genome by chromosomal and comparative genomics. Plant Cell. 23, 1249–1263 (2011).
Article CAS PubMed PubMed Central Google Scholar
Mingcheng, L. et al. Genome sequence of the progenitor of the wheat D genome Aegilops tauschii. Nature. 551, 498–502 (2017).
Article Google Scholar
Palmgren, M. G. et al. Are we ready for back-to-nature crop breeding? Trends Plant Sci. 20, 155–164 (2015).
Article CAS PubMed Google Scholar
Fa, Irbairn, A. The origins and spread of domesticated plants in Southwest Asia and Europe. Environ. Archaeol. 15, 99-100 (2010).
Mascher, M. et al. A chromosome conformation capture ordered sequence of the barley genome. Nature. 544, 426–433 (2017).
Article ADS Google Scholar
Zeng, X. Q. et al. The draft genome of Tibetan hulless barley reveals adaptive patterns to the high stressful Tibetan Plateau. P. Natl. Acad. Sci. USA 112, 1095–1100 (2015).
Article ADS CAS Google Scholar
Mayer, K. F. X. et al. A physical, genetic and functional sequence assembly of the barley genome. Nature. 491, 711–716 (2012).
Article ADS CAS PubMed Google Scholar
Mascher, M. et al. Long-read sequence assembly: a technical evaluation in barley. Plant Cell. 33, 1888–1906 (2021).
Article PubMed PubMed Central Google Scholar
Dai, F. et al. Assembly and analysis of a qingke reference genome demonstrate its close genetic relation to modern cultivated barley. Plant Biotechnol. J. 16, 760–770 (2018).
Article CAS PubMed Google Scholar
Jayakodi, M. et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature. 588, 284–289 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, W. et al. Genome architecture and diverged selection shaping pattern of genomic differentiation in wild barley. Plant Biotechnol. J. (2022).
Belton, J. M. et al. Hi-C: A comprehensive technique to capture the conformation of genomes. Methods. 58, 268–276 (2012).
Article CAS PubMed Google Scholar
Chen, S. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 34, 884–890 (2018).
Article Google Scholar
Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27, 764–770 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, Z. Y. et al. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct. Genomics. 11, 25–37 (2012).
Article PubMed Google Scholar
Myers, G. Building fragment assembly string graphs. Bioinformatics. 21, 79–85 (2005).
Article Google Scholar
Vaser, R. et al. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Simao, F. A. et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 31, 3210–3212 (2015).
Article CAS PubMed Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 23, 1061–1067 (2007).
Article CAS PubMed Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 1303, 1–3 (2013).
Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259–270 (2015).
Article PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 9, 357–354 (2012).
Article CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
Article PubMed PubMed Central Google Scholar
He, W. et al. NGenomeSyn: an easy-to-use and flexible tool for publication-ready visualization of syntenic relationships across multiple genomes. Bioinformatics. 39, 121–122 (2023).
Article Google Scholar
Wang, X. W. & Wang, L. GMATA: An integrated software package for genome-scale SSR mining, marker development and viewing. Front. Plant Sci. 7, 1350 (2016).
PubMed PubMed Central Google Scholar
Gary, B. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics. 25, 1–14 (2009).
Article Google Scholar
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Mob DNA. 110, 462–467 (2005).
CAS Google Scholar
Zhou, Y. et al. Chromosome genome assembly and annotation of the yellowbelly pufferfish with PacBio and Hi-C sequencing data. Sci. Data 6, 267–275 (2019).
Article PubMed PubMed Central Google Scholar
Keilwagen, J. et al. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol. Biol. 1962, 161–177 (2019).
Article CAS PubMed Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29, 15–21 (2012).
Article PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Mario, S. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, 435–439 (2006).
Article Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).
Article PubMed PubMed Central Google Scholar
TransposonPSI. http://transposonpsi.sourceforge.net/.
Bairoch, A. The swiss-prot protein sequence database user manual. Nucleic Acids Res. 28, 45–48 (2000).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics. 10, 421–430 (2009).
Article PubMed PubMed Central Google Scholar
Kanehisa, M. et al. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 428, 726–731 (2016).
Article CAS PubMed Google Scholar
Tatusov, R. L. et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29, 22–28 (2001).
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
Zdobnov, E. M. & Rolf, A. InterProScan–an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848 (2001).
Article CAS PubMed Google Scholar
Altschul, S. F. et al. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
Article CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. et al. Infernal 1.0: inference of RNA alignments. Bioinformatics. 25, 1335–1337 (2009).
Article CAS PubMed PubMed Central Google Scholar
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, 121–124 (2005).
Article Google Scholar
Karin, L. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).
Article Google Scholar
NCBI Assembly https://identifiers.org/ncbi/insdc.gca:GCA_029782615.1 (2023).
NCBI Assembly https://identifiers.org/ncbi/insdc.gca:GCA_029783385.1 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP329205 (2023).
Pan, R. Wild barley genome annotation. Figshare https://doi.org/10.6084/m9.figshare.23501529.v1 (2023).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 27, 2987–2993 (2011).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by Hubei Key Research and Development Program (2021BBA225) and the National Natural Science Foundation of China (31901438, 31471496), Grain Research and Development Corporation (9176507), and the support from Scientific and Technological Innovation Team Foundation of Yangtze University. The authors would also like to thank the Pawsey Supercomputing Centre for the use of their computing resources.

Author information

Authors and Affiliations

Research Center of Crop Stresses Resistance Technologies, Yangtze University, Jingzhou, 434025, China
Rui Pan, Le Xu, Yanhao Xu & Wenying Zhang
Western Crop Genetics Alliance, Western Australian State Agricultural Biotechnology Centre, College of Science, Health, Engineering and Education, Murdoch University, Murdoch, WA, 6155, Australia
Haifei Hu, Chengdao Li & Tianhua He
Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High-Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou, 510640, China
Haifei Hu
Grandomics Biotechnology Co., Ltd, Wuhan, 430076, China
Yuhui Xiao & Kai Ouyang
Hubei Collaborative Innovation Centre for Grain Industry, Yangtze University, Jingzhou, 434025, China
Le Xu & Yanhao Xu
Department of Primary Industries and Regional Development, South Perth, WA, 6155, Australia
Chengdao Li
MARA Key Laboratory of Sustainable Crop Production in the Middle Reaches of the Yangtze River (Co-construction by Ministry and Province), Yangtze University, Jingzhou, 434025, China
Wenying Zhang

Authors

Rui Pan
View author publications
You can also search for this author in PubMed Google Scholar
Haifei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Le Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yanhao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Chengdao Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianhua He
View author publications
You can also search for this author in PubMed Google Scholar
Wenying Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wenying Zhang and Tianhua He designed and conceived this work; Rui Pan and Haifei Hu collected the materials and prepared DNA and RNA for sequencing; Rui Pan and Haifei Hu analyzed the data. Rui Pan wrote the manuscript with other authors’ help; Yuhui Xiao, Le Xu, Yanhao Xu, Kai Ouyang and Chengdao Li revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Tianhua He or Wenying Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Figure S1

Figure S2

Figure S3

Figure S4

Supplementary Tables S3

Supplementary Tables S1

Supplementary Tables S2

Supplementary Figure

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pan, R., Hu, H., Xiao, Y. et al. High-quality wild barley genome assemblies and annotation with Nanopore long reads and Hi-C sequencing data. Sci Data 10, 535 (2023). https://doi.org/10.1038/s41597-023-02434-2

Download citation

Received: 21 April 2023
Accepted: 31 July 2023
Published: 10 August 2023
DOI: https://doi.org/10.1038/s41597-023-02434-2

Subjects

Abstract

Similar content being viewed by others

Background & Summary

Methods

Sample preparation, library construction and sequencing

De novo assembly of the wild barley genome

Chromosome assembly by optical mapping and Hi-C data

Gene model prediction and functional annotation

Annotation of non-coding RNA genes

Data Records

Technical Validation

DNA and RNA integrity

Assessment of the genome assembly

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links