A near complete genome assembly of the East Friesian sheep genome

Advancements in sequencing have enabled the assembly of numerous sheep genomes, significantly advancing our understanding of the link between genetic variation and phenotypic traits. However, the genome of East Friesian sheep (Ostfriesisches Milchschaf), a key high-yield milk breed, remains to be fully assembled. Here, we constructed a near-complete and gap-free East Friesian genome assembly using PacBio HiFi, ultra-long ONT and Hi-C sequencing. The resulting genome assembly spans approximately 2.96 Gb, with a contig N50 length of 104.1 Mb and only 164 unplaced sequences. Remarkably, our assembly has captured 41 telomeres and 24 centromeres. The assembled sequence is of high quality on completeness (BUSCO score: 97.1%) and correctness (QV: 69.1). In addition, a total of 24,580 protein-coding genes were predicted, of which 97.2% (23,891) carried at least one conserved functional domain. Collectively, this assembly provides not only a near T2T gap-free genome, but also provides a valuable genetic resource for comparative genome studies of sheep and will serve as an important tool for the sheep research community.


Background & Summary
Selective breeding for different agricultural purposes, such as meat, wool, and milk, have established many sheep breeds with unique characteristics worldwide 1 .The East Friesian sheep (Ostfriesisches Milchschaf) is a highly specialized breed.The breed originates from the Frisia region of both the Netherlands and Germany, and is considered to be the world's highest producing dairy sheep 2,3 .In a single lactation, the East Friesian sheep can produce 500-700 kg of milk over a period of approximately 230 days 4 .Additionally, East Friesian sheep have a relatively high average number of lambs per ewe, 2.25 lambs/litter, but the carcass of lambs is very lean 5 .In physical appearance, East Friesian sheep have many unique features.They have a relatively large body, head, face, legs, ears all clean of wool.Their most distinctive physical feature is a "rat-tail" which is thin and devoid of wool.The East Friesian sheep, renowned for its adaptability, has been successfully crossbred with breeds known for their robust ketone body composition, such as Suffolk, Dorset, and Texel.This strategic crossbreeding not only enhances the meat quality of the East Friesian sheep but also ameliorates the traits of breeds that exhibit lower milk yields and suboptimal reproductive and lambing capabilities.Hailing from the northern regions of Germany and the Friesland area in the Netherlands, the East Friesian breed has garnered international attention and has been integrated into the livestock industries of various countries, including China, the United Kingdom, and South Africa.The exploration of the breed's genetic makeup at the molecular level presents a compelling opportunity to deepen our comprehension of the genetic underpinnings of economically significant traits in sheep, thereby contributing to the advancement of the field.
De novo genome assembly is a fundamental and powerful tool employed in the realm of molecular research.Several genomes of sheep genomes have been made publicly available in databases, including East Friesian sheep 6 , Tibetan sheep 7 , Rambouillet sheep 8 , and Texel sheep 9 .Despite the achievement of chromosome-level assembly in these sheep genomes, there still exist unidentified regions containing gaps that require further investigation and determination.A number of assemblers have been developed for long reads assembly, such as Falcon 10 , Flye 11 , Canu 12 , wtdbg2 13 , NextDenovo 14 and Hifiasm 15 .The Hifiasm method stands out for its www.nature.com/scientificdatawww.nature.com/scientificdata/utilization of string-overlap graphs to represent genomes, encode information for algorithmic analysis, and visually present both primary and alternative paths along a DNA sequence 16 .New developments in long-read sequencing technologies, such as Pacific Biosciences (PacBio) circular consensus (CCS) long-read sequencing and ultra-long ONT sequencing, has revolutionized our ability to acquire comprehensive chromosome sequences spanning from one telomere to another.With the availability of a complete genome sequence, researchers would have the opportunity to thoroughly investigate and gain a deeper understanding of genome function, regulation, and evolution 17,18 .
In this study, we present the first near T2T gap-free genome assembly for East Friesian sheep using a combination of PacBio high-fidelity (HiFi) long-read, Oxford Nanopore (ONT) ultra long-read, and high-throughput chromosome conformation capture (Hi-C) sequencing data.In total, we generated 321 Gb (~107X coverage) ONT reads with a N50 of 63.5 kb, 148 Gb PacBio HiFi CCS reads with a N50 of 22.1 kb (~49X coverage), and 396 Gb Hi-C data (MGISEQ paired-end reads, ~132X coverage) (Table 1).The final genome assembly of East Friesian sheep, termed as EFS v2.0, is about 2.96 Gb with a scaffold N50 of 104.10 Mb, comprising 27 chromosomes without any gaps (Table 2; Fig. 1) and 164 unplaced sequences.We observed that 94.53% of these unplaced sequences consist of repetitive elements, among which satellite sequences constitute 84.64%.Further research and refinement are needed to determine their precise genomic location and functional relevance.The EFS v2.0 assembly captured 41 telomeres and 24 centromeres (Table 3).Notably, the EFS v2.0 assembly closed 35 gaps in total compared to the previously published East Friesian genome 6 (Fig. 2).
In the EFS v2.0 genome, repeat sequences accounted for 1.60 Gb, representing 53.98% of the assembly (Table 4).Long interspersed nuclear elements (LINE) retrotransposons (41.46%) were the most abundant component among repetitive elements, which was consistent with a previous study 19 (Table 5).Gene annotation identified 24,580 protein-coding genes.Of which, 24,536 genes (99.8%) were anchored to 27 chromosomes (Fig. 1), while 44 genes anchored to unplaced scaffolds.The length and number of exons were similar to those of three other sheep breeds (Fig. 3a,b).Furthermore, the predicted proteins achieved a complete BUSCO score of approximately 98%, indicating high quality annotation (Fig. 3c).23,891 (97.2%) protein-coding genes were successfully annotated in diverse databases, including Gene Ontology (GO), KOG, Interpro, SwissProt 20 , Kyoto Encyclopedia of Genes and Genomes (KEGG) 21 , NCBI nonredundant database (NR), and Translation of European Molecular Biology Laboratory (Trembl) (Table 6).Moreover, 17,328 (~70.5%)genes were supported by all five databases (Fig. 3d).Based on transcriptomic deep-sequencing data, we investigated gene expression level in five different tissues (Table 7).A total of 15,263 (62.2%) genes showed detectable expression levels (transcripts per million ≥ 1) in one or more of these tissues.Through structural variants analysis with the previously published East Friesian sheep 6 , we identified 232 newly assembled genes, among which 151 were expressed in 5 different transcriptome samples (Table 8; Fig. 4).

Methods
Sample collection, DNA preparation and RNA extractions.A 1-year-old female East Friesian sheep from Inner Mongolia key Lab of Bio-manufacture in Inner Mongolia autonomous region of China was chosen for DNA and RNA sequencing.The assembled sequence does not include the Y chromosome due to sampling from females.The animal was healthy, and no genetic defects were observed in it or its parents.www.nature.com/scientificdatawww.nature.com/scientificdata/DNA was extracted from fresh blood specimen using the QIAGEN Blood & Cell Culture DNA Midi Kit according to the manufacturer's instruction (QIAGEN, Germany).TRIzol (Invitrogen, Carlsbad, CA, United States) was used to extract total RNA from heart, rumen, subcutaneous fat, lung and perirenal fat tissues.The concentration of total RNA was determined using the Nano 6000 spectrophotometer Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, United States).The RNA purity was determined using the Qubit ® RNA Assay Kit in a Qubit ® 2.0 Fluorometer (Life Technologies, Camarillo, CA, United States).
Long insert libraries preparation and sequencing.The library construction and sequencing of RNA-seq full-length transcripts were conducted using a method similar to that described in Yuan, Ge et al. 22 , resulting in 437,807 full-length non-chimeric reads with mean length of 1,388 bp.
For the DNA PacBio long inserts libraries, the preparation was carried out in accordance with the "Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input" protocol 23 provided by PacBio (Pacific Biosciences, USA).This resulted in libraries with an insert size of approximately 20 kb.Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms operating in CCS mode.The subreads were processed through the CCS algorithm of SMRTLink (v8.0.0) 24 with specific parameters: "-minPasses 3 -minPredictedAccuracy 0.99 -minLength 500", yielding 148 Gb of PacBio's long high-fidelity (HiFi) reads in total.
Furthermore, ultra-long DNA ONT libraries were created following the protocols detailed by Shafin et al. 25 .These libraries were then sequenced on the PromethION sequencer platform (Oxford Nanopore Technologies, UK).The sequencing effort resulted in the production of 8,180,779 reads, with an N50 value of 63,509 bp.
The Hi-C library was prepared using the same method described in Yin, Chen et al. 26 with the same blood specimen and sequenced on a MGISEQ-2000 instrument.A total of 395 Gb of clean data were obtained from 396 Gb of sequencing data using software SOAPnuke (v2.0) 27 with parameters "-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5".

Genome assembly.
With the HiFi reads, the primary contigs were assembled using Hifiasm (v 0.16.1) 15 with default parameters.The Hi-C valid reads were employed to anchor contigs onto chromosomes through Juicer 28 and 3d-dna pipeline 29 .The chromosome nomenclature was adopted for the chromosome numbering on the basis of their collinearity with 27 chromosomes of Texel sheep genome 30 .To achieve a near T2T gap-free reference genome assembly, gaps in the assembly genome were filled using LR_Gapcloser 31 with error-corrected ONT long reads produced by NECAT 32 .
Identification of new assembled genes.The software Syri (v1.6.3) 56was employed to detect structural variations between the EFS v2.0 genome assembly and the previously published East Friesian sheep 6 .A gene was classified as newly assembled if the previously published East Friesian sheep 6 exhibited a deletion of at least 50 bp and the gene region had a minimum overlap of 30% with that region.
Reads coverage analysis of genome assembly.We assessed whether the long sequencing reads extended across the regions that required gap filling.Prior to this process, the genome contained eight gaps.We employed minimap2 57 (v 2.24) to map both the ONT and HiFi reads to the EFS v2.0 genome.Utilizing SAMtools 58 (v 1.10) with the '-q 20' option, we filtered out low-quality and multi-mapping reads.Subsequently, we utilized the IGV software for visualizing the high-quality alignment results.www.nature.com/scientificdatawww.nature.com/scientificdata/Quality value (QV) calculations.In the realm of whole-genome sequencing, the Quality Value (QV) emerges as an essential metric for gauging the precision of nucleotide identification.The QV is derived from the Phred quality score, a measure that captures the negative logarithm of the likelihood that a given base call is erroneous.The QV is precisely calculated through the equation QV = −10 × log 10 (error probability).For instance, an error probability of 0.001 equates to a QV of 30, indicating a high confidence in the correctness of the base call.Throughout the sequencing process, each nucleotide is appraised with a Phred score that is contingent upon the signal-to-noise ratio; this score is subsequently converted to a QV, thereby providing an index of the sequencing data's fidelity.In this study, we have employed the Merqury 59 software to meticulously compute the QV, ensuring robust data quality assessment.Fig. 7 The identification of syntenic regions for EFS v2.0, Rambouillet sheep and Tibetan sheep was based on conducting homology searches using MCScan (Python version) 63 , with a minimum requirement of 30 genes per block.Macrosynteny connecting blocks of >30 one-to-one gene pairs are shown. www.nature.com/scientificdatawww.nature.com/scientificdata/

Data Records
The DNA sequence reads of East Friesian sheep (Experiment of DNA sequencing data from ultra-long ONT library: SRR26273756 60 ; Experiments of DNA sequencing data from Hi-C library: SRR26273763 60 ; Experiments of DNA sequencing data from PacBio HiFi library: SRR26273762 60 ) and RNA sequence reads of East Friesian sheep (Experiment of 5 transcriptome libraries: SRR26273757-SRR26273761 60 ) have been deposited in the Sequence Read Archive (SRA).The genome assembly have been deposited in the GenBank database under the accession number JAWMPZ000000000 61 .The files of the gene structure annotation, repeat predictions and gene functional annotation have been deposited at Figshare database 62 .

Technical Validation
Multiple methods were employed to validate the accuracy and completeness of EFS v2.0 assembly.
Firstly, we utilized long sequencing reads to ascertain their extension across the eight gap regions (Table 9).The resulting plots confirmed comprehensive coverage of the targeted regions (Fig. 5).Secondly, the Hi-C heatmap displayed high consistency across all chromosomes, demonstrating the correct ordering and orientation of contigs in the EFS v2.0 assembly (Fig. 6).Thirdly, the EFS v2.0 assembly exhibited high collinearity with Rambouillet sheep (GCA_016772045.1) 8, Tibetan sheep (GCA_017524585.1) 7and the previously published East Friesian sheep (GCA_018804185.1) 6(Fig. 7).Fourthly, the accuracy was confirmed by the high mapping rates of two type sequences on the EFS v2.0 assembly, with 99.93% of ONT reads and 100% of HiFi reads aligning to the EFS v2.0 assembly.Notably, the sequencing assembly attained a remarkable quality value (QV) score of 69.1, signifying an exceptionally low error rate of approximately 1.26 errors per 100 million bases.This level of sequencing accuracy and reliability is highly commendable and will undoubtedly facilitate subsequent genetic analysis and research.Lastly, the Benchmarking Universal Single-Copy Orthologs (BUSCO) test revealed that EFS v2.0 assembly successfully identified 97.1% of 9,226 mammalia gene sets, which exhibiting the highest level of BUSCO completeness among the four commonly used genomes (Fig. 8).

Fig. 1
Fig. 1 Circos plot of the EFS v2.0 genome.From inside to outside, I: GC content in nonoverlapping 1 Mb windows (histograms); II: percent coverage of repetitive sequences in nonoverlapping 1 Mb windows (heat maps); III: gene density calculated based on the number of genes in nonoverlapping 1 Mb windows (heat maps); IV: 27 super-scaffolds.Lengths are shown in Mb.

Fig. 3
Fig. 3 Quality assessment of the protein-coding genes in the EFS v2.0 assembly.(a) Comparison of exon length among four sheep gene sets.Window refers to the length of every point.(b) Comparison of exon number among four sheep gene sets.No obvious unexpected differences exist among these four organisms, indicating the high quality of gene structure annotation.(c) BUSCO assessment results of protein-coding genes in the EFS v2.0 assembly.(d) Gene function annotation results in a statistics Venn diagram using five public databases: NR, InterPro, KEGG, SwissProt and KOG.

Fig. 5
Fig. 5 Using IGV to demonstrate the coverage of ONT and PacBio reads in the gap 1 region.The IGV images for Gap 1 through Gap 8 are available through the Figshare database 62 .

Fig. 6
Fig.6 The accuracy and completeness of the EFS v2.0 genome assembly.Whole-genome Hi-C heatmap of EFS v2.0 within and between 27 chromosomes.

Table 1 .
Summary of sequencing data of East Friesian sheep genome.

Table 2 .
Comparison of four sheep genomes.

Table 3 .
Centromere positions of East Friesian sheep genome.
Fig. 2 Overview of the near T2T and gap-free EFS v2.0 reference genome.The box represents the 35 closed gaps identified from GCA_018804185.1.The triangle represents the telomere region, and the circle represents the centromere region.

Table 4 .
General statistics of repeats in the EFS v2.0 assembly.Note: Some elements may partially overlap with another element domain.

Table 5 .
Transposable elements (TEs) in the assembled EFS v2.0 assembly.Note: This statistical table does not contain Tandem Repeats, some elements may partly include another element domain.*Combined: the nonredundant consensus of all repeat prediction/classification methods employed.† Unknown: the predicted repeats that cannot be classified by RepeatMasker; LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeat.

Table 6 .
Number of functional annotations for predicted genes in the EFS v2.0 assembly.

Table 7 .
Summary of RNA-seq sequencing data of East Friesian sheep genome.

Table 9 .
The location of the gap to be filled.