Chromosome-level genome assembly of the critically endangered Baer’s pochard (Aythya baeri)

Baer’s pochard (Aythya baeri) is a critically endangered species historically widespread throughout East Asia, whose population according to a recent estimate has decreased to between 150 and 700 individuals, and faces a long-term risk of extinction. However, the lack of a reference genome limits the study of conservation management and molecular biology of this species. We therefore report the first high-quality genome assembly of Baer’s pochard. The genome has a total length of 1.14 Gb with a scaffold N50 of 85,749,954 bp and a contig N50 of 29,098,202 bp. We anchored 97.88% of the scaffold sequences onto 35 chromosomes based on the Hi-C data. BUSCO assessment indicated that 97.00% of the highly conserved Aves genes were completely present in the genome assembly. Furthermore, a total of 157.06 Mb of repetitive sequences were identified and 18,581 protein-coding genes were predicted in the genome, of which 99.00% were functionally annotated. This genome will be useful for understanding Baer’s pochard genetic diversity and facilitate the conservation planning of this species.


Background & Summary
Baer's pochard is a migratory duck belonging to the order Anseriformes, family Anatidae, and genus Aythya, whose closest relative and sister species is the ferruginous duck 1 . Baer's pochard has typical sexual dimorphism. Males have white or light-yellow irises ( Fig. 1), whereas females have dark brown irises. Females also have reddish brown spots at the base of the beak 2,3 , and are smaller in size. This species was once widespread in East and South Asia, but is currently predominantly only in China 4,5 due to over-exploitation and habitat loss, which have caused a severe and global population decline over the past decades 6,7 . Baer's pochard was classified as endangered by the International Union for Conservation of Nature (IUCN) in 2008, then as Critically Endangered in 2012, and in 2021 was included in the China Red Data Book of Endangered Animals. According to a recent estimate by the IUCN, its population has only 150-700 mature individuals 8 , and faces a long-term risk of extinction. Moreover, although there has been an increasing number of avian genome assemblies in recent years 9 , many non-model species including Baer's pochard still lack genome resources.
In order to provide genome-scale insights into a near-extinction species and promote conservation planning for it, we constructed the first high-quality Baer's pochard chromosome-level reference genome using Illumina paired-end sequencing, Oxford Nanopore sequencing, and Hi-C technology. The genome had an assembly size of 1.14 Gb with a scaffold N50 of 85,749,954 bp and a contig N50 of 29,098,202 bp. These scaffolds were further clustered and ordered into 35 pseudo-chromosomes based on the Hi-C data, representing 97.88% of the assembled sequences. The genome contained 13.72% repeat sequences and 1,721 noncoding RNAs. A total of 18,581 protein-coding genes were predicted in the genome, of which 99.00% were functionally annotated. Searches for complete Aves BUSCO (Benchmarking Universal Single-Copy Ortholog) gene groups showed that 97.00% of BUSCO genes were complete, suggesting a high level of genome completeness. This genome provides a valuable genomics resource for studying the conservation genomics of critically endangered species to help recover their population size. Sample and sequencing. Baer's pochard tissue for whole-genome sequencing was obtained from a dead individual that had strayed into a fishing net in Shandong (China). The muscle tissue that we collected was stored at −80 °C and used for genomic DNA extraction, genomic DNA sequencing. Nine additional transcriptomic samples (heart, kidney, lung, spleen, liver, craw, gallbladder, blood, and muscle) were collected from the same individual and stored at −80 °C until RNA were extracted for transcriptome sequencing. Paired-end libraries of genomic DNA (gDNA) were prepared using Illumina TruSeq Nano DNA Library Prep kits. The integrity and quality of     www.nature.com/scientificdata www.nature.com/scientificdata/ the extracted DNA were checked using agarose gel electrophoresis and a Qubit Fluorometer. One library with an insertion size of 350 bp was constructed and sequenced using the Illumina HiSeq platform to enable genome survey and base-level correction. A total of 60.34 Gb (coverage of 49.69×) of 150-bp paired-end reads were generated. Purified DNA was then prepared for sequencing with the genomic sequencing kit SQK-LSK109 (Oxford Nanopore Technologies, Oxford, UK) following the provided protocol, and single-molecule real-time sequencing of long reads was conducted using the PromethION platform (ONT, Oxford, UK). Approximately 136.50 Gb of data was obtained (coverage of 112.42×). The Hi-C library was constructed using muscle tissue from the same Baer's pochard individual and sequenced using the Illumina PE150 platform. A total of 125.64 Gb of 150-bp paired-end reads were obtained, which covered ~103.48× of the genome (Table 1). Finally, RNA was extracted from the nine transcriptomic samples and used for library construction, and RNA-Seq reads were generated for genome annotation using the Illumina NovaSeq 6000 platform. A total of 67.93 Gb of 150-bp paired-end reads were obtained after adapter trimming and quality filtering ( Table 2). Genome assembly. We used a combination of Nanopore long reads, Illumina short reads, and chromatin conformation capture (Hi-C) to generate chromosome-level reference genomes. The genome size and heterozygosity level of the Baer's pochard were determined using Illumina short reads based on the k-mers spectrum 10 . The genome size was estimated to be approximately 1,214.25 Mb, and the heterozygosity rate of the genome is 0.38% (Table 3). NextDenovo (https://github.com/Nextomics) used Nanopore long reads for the initial scaffolding assemblies. However, long reads have low quality scores, and thus NextPolish 11 -which uses quality-controlled Illumina short reads, was employed to improve the assembled genome. These steps yielded the final Baer's pochard genome with a total length of 1.14 Gb, which was mostly consistent with the k-mer-based estimation including 228 contigs with N50 = 29,098,202 bp, and the overall GC content of the genome was 41.94% (Table 4).  www.nature.com/scientificdata www.nature.com/scientificdata/ We had obtained 125.64 Gb of Hi-C sequencing data to generate this chromosomal-level assembled genome. We first used HiCUP 12 to map and process the reads obtained from the Hi-C library, then the Hi-C-corrected contigs were subjected to the ALLHiC pipeline 13 for partition, orientation and ordering. A total of 135 scaffolds could be mapped to 35 chromosomes with lengths ranging from 1.77 Mb to 208.01 Mb, which covered 97.88%      Table 9. Annotation of repeated sequences.
www.nature.com/scientificdata www.nature.com/scientificdata/ of the whole genome. Finally, we obtained the first chromosome-level high-quality Baer's pochard assembly (1.14 Gb) with a scaffold N50 length of 85.75 Mb (Table 5 and Fig. 2). The genome size, scaffold N50 length, and GC content of Aythya baeri is similar to that of Aythya fuligula (RefSeq assembly access: GCF_009819795.1), a member of the same genus, but its contigN50 length is much longer than that of Aythya fuligula (Table 6). This indicates that the genome of Aythya baeri has high assembly quality.
We used the Core Eukaryotic Genes Mapping Approach (CEGMA v2.5) 14 and Benchmarking Universal Single-Copy Orthologs (BUSCO v4.1.2) 15 methods to evaluate the completeness of genome assembly. A single-copy ortholog set was searched against the assembled genome of Baer's pochard using BUSCO tool, of the 8,338 single-copy orthologs in the avian lineage (aves_odb10), approximately 97.00% were present in this assembly (Table 7). We took the conserved genes (248 genes) of six eukaryotic model organisms to form the core gene library, of which the CEGMA evaluation showed 95.97% was successfully assembled (Table 8). www.nature.com/scientificdata www.nature.com/scientificdata/ Annotation of genomic repeat sequences. We annotated the Baer's pochard whole-genome repeat sequences based on homology alignment and de novo predictions. RepeatModeler (v1.0.8) 16 , RepeatScout (v1.0.5) 17 and LTR_FINDER (v1.0.7) 18 were used to build a de novo repetitive element database. Tandem repeats were extracted using TRF 19 via ab initio prediction. Homolog prediction was performed using the Repbase database 20 whilst employing the RepeatMasker (v4.0.5) software 21 to extract repeat regions (Table 9). According to these analyses, approximately 1,571 Mb of repeat sequences were revealed, which accounted for 13.72% of the whole genome; thus, the content of repeat sequence in A. baeri genome is slightly higher than that in the A. fuligul genome (13.00%). Among the repeat elements, long interspersed nuclear elements (LINEs) account for 8.80% of the genome, short interspersed nuclear elements (SINEs) for 0.01%, long terminal repeats (LTRs) for 4.13% and DNA transposons for 0.15% (Table 10).
We also predicted 432 tRNAs using the program tRNAscan-SE 32 . We identified 664 ncRNAs, including 342 miRNAs and 322 snRNAs, by searching against the Rfam database with default parameters using Infernal 33 . For rRNAs that were highly conserved, we chose related species' rRNA sequences as references and predicted 161 rRNA sequences using Blast 34 (Table 12).
Synteny analysis using the Tufted duck genome. We conducted whole-genome synteny analysis between the Tufted duck (GCA_009819795.1) and the Baer's pochard genomes using MUMmer 40 . The whole-genome alignment between the tufted duck and the Baer's pochard genomes was visualized using RectChr (BGI-shenzhen/ RectChr), as shown in Fig. 4. The results showed the overall high consistency of the tufted duck and the Baer's pochard genomes.

Data Records
The Nanopore, Illumina, and Hi-C sequencing data used for genome assembly were deposited in the NCBI Sequence Read Archive database with accession numbers SRR17568785 41 , SRR17518553 42 , and SRR17509905 43 . The transcriptomic sequencing data were stored under accession numbers SRR17433182 44 and SRR17497023 45 -SRR17497030. The assembled genome was deposited in the NCBI assembly with the accession number JAKRSJ000000000 46 . The annotation results of repeated sequences, gene structure and functional prediction were deposited in the Figshare database 47 . www.nature.com/scientificdata www.nature.com/scientificdata/

technical Validation
The integrity of the extracted DNA was checked by agarose gel electrophoresis, and the main band was found to be approximately 45 Kb long. The concentration of DNA was determined using a Qubit fluorometer (Thermo Fisher Scientific, USA) with an absorbance of approximately 1.80 at 260/280. We used the sequence identity method to evaluate the completeness of the genome assembly, selected small fragment library reads, and used BWA software (http://bio-bwa.sourceforge.net/) to align them with the assembled genome. The alignment rate of all small fragment reads to the genome was approximately 99.71%, and the coverage rate was approximately 99.45%, indicating consistency between the reads and assembled genome.
SNPs were identified using Samtools (v0. 1.19), resulting in the identification of 3,162,696 SNPs, including 3,157,033 heterozygous SNPs and 5,663 homozygous SNPs. The proportion of homozygous SNPs was 0.000502%, indicating the high accuracy of this assembly.

Code availability
All commands and pipelines used in data processing were executed according to the manual and protocols of the corresponding bioinformatic software. No specific code has been developed for this study.