Background & Summary

Baer’s pochard is a migratory duck belonging to the order Anseriformes, family Anatidae, and genus Aythya, whose closest relative and sister species is the ferruginous duck1. Baer’s pochard has typical sexual dimorphism. Males have white or light-yellow irises (Fig. 1), whereas females have dark brown irises. Females also have reddish brown spots at the base of the beak2,3, and are smaller in size. This species was once widespread in East and South Asia, but is currently predominantly only in China4,5 due to over-exploitation and habitat loss, which have caused a severe and global population decline over the past decades6,7. Baer’s pochard was classified as endangered by the International Union for Conservation of Nature (IUCN) in 2008, then as Critically Endangered in 2012, and in 2021 was included in the China Red Data Book of Endangered Animals. According to a recent estimate by the IUCN, its population has only 150–700 mature individuals8, and faces a long-term risk of extinction. Moreover, although there has been an increasing number of avian genome assemblies in recent years9, many non-model species including Baer’s pochard still lack genome resources.

Fig. 1
figure 1

An adult male Baer’s pochard.(Qiang Li).

In order to provide genome-scale insights into a near-extinction species and promote conservation planning for it, we constructed the first high-quality Baer’s pochard chromosome-level reference genome using Illumina paired-end sequencing, Oxford Nanopore sequencing, and Hi-C technology. The genome had an assembly size of 1.14 Gb with a scaffold N50 of 85,749,954 bp and a contig N50 of 29,098,202 bp. These scaffolds were further clustered and ordered into 35 pseudo-chromosomes based on the Hi-C data, representing 97.88% of the assembled sequences. The genome contained 13.72% repeat sequences and 1,721 noncoding RNAs. A total of 18,581 protein-coding genes were predicted in the genome, of which 99.00% were functionally annotated. Searches for complete Aves BUSCO (Benchmarking Universal Single-Copy Ortholog) gene groups showed that 97.00% of BUSCO genes were complete, suggesting a high level of genome completeness. This genome provides a valuable genomics resource for studying the conservation genomics of critically endangered species to help recover their population size.

Methods

Ethics statement

All animal handling and experimental procedures were approved by the Qufu Normal University Biomedical Ethics Committee (approval number: 2022001).

Sample and sequencing

Baer’s pochard tissue for whole-genome sequencing was obtained from a dead individual that had strayed into a fishing net in Shandong (China). The muscle tissue that we collected was stored at −80 °C and used for genomic DNA extraction, genomic DNA sequencing. Nine additional transcriptomic samples (heart, kidney, lung, spleen, liver, craw, gallbladder, blood, and muscle) were collected from the same individual and stored at −80 °C until RNA were extracted for transcriptome sequencing. Paired-end libraries of genomic DNA (gDNA) were prepared using Illumina TruSeq Nano DNA Library Prep kits. The integrity and quality of the extracted DNA were checked using agarose gel electrophoresis and a Qubit Fluorometer. One library with an insertion size of 350 bp was constructed and sequenced using the Illumina HiSeq platform to enable genome survey and base-level correction. A total of 60.34 Gb (coverage of 49.69×) of 150-bp paired-end reads were generated. Purified DNA was then prepared for sequencing with the genomic sequencing kit SQK-LSK109 (Oxford Nanopore Technologies, Oxford, UK) following the provided protocol, and single-molecule real-time sequencing of long reads was conducted using the PromethION platform (ONT, Oxford, UK). Approximately 136.50 Gb of data was obtained (coverage of 112.42×). The Hi-C library was constructed using muscle tissue from the same Baer’s pochard individual and sequenced using the Illumina PE150 platform. A total of 125.64 Gb of 150-bp paired-end reads were obtained, which covered ~103.48× of the genome (Table 1). Finally, RNA was extracted from the nine transcriptomic samples and used for library construction, and RNA-Seq reads were generated for genome annotation using the Illumina NovaSeq 6000 platform. A total of 67.93 Gb of 150-bp paired-end reads were obtained after adapter trimming and quality filtering (Table 2).

Table 1 Sequencing data for A. baeri genome assembly.
Table 2 Statistical analysis of transcriptome sequencing results of nine organs.

Genome assembly

We used a combination of Nanopore long reads, Illumina short reads, and chromatin conformation capture (Hi-C) to generate chromosome-level reference genomes. The genome size and heterozygosity level of the Baer’s pochard were determined using Illumina short reads based on the k-mers spectrum10. The genome size was estimated to be approximately 1,214.25 Mb, and the heterozygosity rate of the genome is 0.38% (Table 3). NextDenovo (https://github.com/Nextomics) used Nanopore long reads for the initial scaffolding assemblies. However, long reads have low quality scores, and thus NextPolish11— which uses quality-controlled Illumina short reads, was employed to improve the assembled genome. These steps yielded the final Baer’s pochard genome with a total length of 1.14 Gb, which was mostly consistent with the k-mer-based estimation including 228 contigs with N50 = 29,098,202 bp, and the overall GC content of the genome was 41.94% (Table 4). We had obtained 125.64 Gb of Hi-C sequencing data to generate this chromosomal-level assembled genome. We first used HiCUP12 to map and process the reads obtained from the Hi-C library, then the Hi-C-corrected contigs were subjected to the ALLHiC pipeline13 for partition, orientation and ordering. A total of 135 scaffolds could be mapped to 35 chromosomes with lengths ranging from 1.77 Mb to 208.01 Mb, which covered 97.88% of the whole genome. Finally, we obtained the first chromosome-level high-quality Baer’s pochard assembly (1.14 Gb) with a scaffold N50 length of 85.75 Mb (Table 5 and Fig. 2). The genome size, scaffold N50 length, and GC content of Aythya baeri is similar to that of Aythya fuligula (RefSeq assembly access: GCF_009819795.1), a member of the same genus, but its contigN50 length is much longer than that of Aythya fuligula (Table 6). This indicates that the genome of Aythya baeri has high assembly quality.

Table 3 K-mer frequency and genome size evaluation of A. baeri.
Table 4 The result of A. baeri genome assembly.
Table 5 Chromosome and reference genome corresponding chromosome statistical results.
Fig. 2
figure 2

Heat map of Hi-C assembly of the Baer’s pochard.

Table 6 Comparative analysis of the genome of A. baeri and A. fuligula.

We used the Core Eukaryotic Genes Mapping Approach (CEGMA v2.5)14 and Benchmarking Universal Single-Copy Orthologs (BUSCO v4.1.2)15 methods to evaluate the completeness of genome assembly. A single-copy ortholog set was searched against the assembled genome of Baer’s pochard using BUSCO tool, of the 8,338 single-copy orthologs in the avian lineage (aves_odb10), approximately 97.00% were present in this assembly (Table 7). We took the conserved genes (248 genes) of six eukaryotic model organisms to form the core gene library, of which the CEGMA evaluation showed 95.97% was successfully assembled (Table 8).

Table 7 BUSCO analysis result of A. baeri genome.
Table 8 Statistical evaluation of genomic integrity by CEGMA.

Annotation of genomic repeat sequences

We annotated the Baer’s pochard whole-genome repeat sequences based on homology alignment and de novo predictions. RepeatModeler (v1.0.8)16, RepeatScout (v1.0.5)17 and LTR_FINDER (v1.0.7)18 were used to build a de novo repetitive element database. Tandem repeats were extracted using TRF19 via ab initio prediction. Homolog prediction was performed using the Repbase database20 whilst employing the RepeatMasker (v4.0.5) software21 to extract repeat regions (Table 9). According to these analyses, approximately 1,571 Mb of repeat sequences were revealed, which accounted for 13.72% of the whole genome; thus, the content of repeat sequence in A. baeri genome is slightly higher than that in the A. fuligul genome (13.00%). Among the repeat elements, long interspersed nuclear elements (LINEs) account for 8.80% of the genome, short interspersed nuclear elements (SINEs) for 0.01%, long terminal repeats (LTRs) for 4.13% and DNA transposons for 0.15% (Table 10).

Table 9 Annotation of repeated sequences.
Table 10 Repetitive elements and their proportions in A. baeri genome.

Annotation of gene structure

We combined three approaches to predict protein-coding genes, including homologous comparison, ab initio prediction, and RNA-Seq-assisted prediction. For homologous comparison, the reference protein sequences of five bird species— the tufted duck (Aythya fuligula), mallard (Anas platyrhynchos), mute swan (Cygnus olor), red junglefowl (Gallus gallus), and ruddy duck (Oxyura jamaicensis), were sourced from the Ensembl database (release 91), and aligned to the Baer’s pochard genome using TBlastN (v2.2.26; E-value ≤ 1e-5)22. The potential gene structures were predicted using Genewise (v2.4.1)23. For ab initio analysis based gene prediction, we used Augustus (v3.2.3)24, Geneid (v1.4)25, Genescan (v1.0)26, GlimmerHMM (v3.04)27 and SNAP28 with appropriate parameters to perform de novo predictions. To optimize the genome annotation, RNA-Seq reads from nine different tissues were assembled de novo using Trinity (v2.1.1)29, and TopHat (v2.0.11)30 was used to align RNA-seq reads to the Baer’s pochard genome sequences. Cufflink software was then employed to determine potential gene structures. We used EvidenceModeler (EVM,v1.1.1) and PASA (Program to Assemble Spliced Alignment) to integrate all the results generated from the three aforementioned methods and create a non-redundant reference gene set31 composed of 18,581 genes, with an average CDS lengths of 1,600.42 bp, average exon and intron lengths were 169.04 bp and 2,763.57 bp, respectively (Table 11).

Table 11 Prediction of protein-coding genes.

We also predicted 432 tRNAs using the program tRNAscan-SE32. We identified 664 ncRNAs, including 342 miRNAs and 322 snRNAs, by searching against the Rfam database with default parameters using Infernal33. For rRNAs that were highly conserved, we chose related species’ rRNA sequences as references and predicted 161 rRNA sequences using Blast34 (Table 12).

Table 12 Annotation of non-coding RNA genes.

Functional annotation of protein-coding genes

We functionally annotated the predicted proteins in the Baer’s pochard genome according to homologous searches against six databases: SwissProt35, InterPro36, Pfam37, Kyoto Encyclopedia of Genes and Genomes (KEGG)38, Gene Ontology (GO)39, and Nr (http://www.ncbi.nlm.nih.gov/protein). Respectively, 82.39%, 98.90%, 76.00%, 77.40%, 91.90%, and 85.30% of genes matched the database entries (Fig. 3). In summary, 18,401 genes (99.00%) were successfully annotated by gene function and conserved protein motifs (Table 13).

Fig. 3
figure 3

Functional annotation statistics. Venn diagram illustrating the distribution of high-score matches of the functional annotation in the Baer’s pochard genome against six public databases.

Table 13 Functional annotation of the predicted protein-coding genes.

Synteny analysis using the Tufted duck genome

We conducted whole-genome synteny analysis between the Tufted duck (GCA_009819795.1) and the Baer’s pochard genomes using MUMmer40. The whole-genome alignment between the tufted duck and the Baer’s pochard genomes was visualized using RectChr (BGI-shenzhen/RectChr), as shown in Fig. 4. The results showed the overall high consistency of the tufted duck and the Baer’s pochard genomes.

Fig. 4
figure 4

Circos plot of the synteny analysis between the tufted duck and the Baer’s pochard genome.

Data Records

The Nanopore, Illumina, and Hi-C sequencing data used for genome assembly were deposited in the NCBI Sequence Read Archive database with accession numbers SRR1756878541, SRR1751855342, and SRR1750990543. The transcriptomic sequencing data were stored under accession numbers SRR1743318244 and SRR1749702345-SRR17497030. The assembled genome was deposited in the NCBI assembly with the accession number JAKRSJ00000000046. The annotation results of repeated sequences, gene structure and functional prediction were deposited in the Figshare database47.

Technical Validation

The integrity of the extracted DNA was checked by agarose gel electrophoresis, and the main band was found to be approximately 45 Kb long. The concentration of DNA was determined using a Qubit fluorometer (Thermo Fisher Scientific, USA) with an absorbance of approximately 1.80 at 260/280.

We used the sequence identity method to evaluate the completeness of the genome assembly, selected small fragment library reads, and used BWA software (http://bio-bwa.sourceforge.net/) to align them with the assembled genome. The alignment rate of all small fragment reads to the genome was approximately 99.71%, and the coverage rate was approximately 99.45%, indicating consistency between the reads and assembled genome.

SNPs were identified using Samtools (v0.1.19), resulting in the identification of 3,162,696 SNPs, including 3,157,033 heterozygous SNPs and 5,663 homozygous SNPs. The proportion of homozygous SNPs was 0.000502%, indicating the high accuracy of this assembly.