Background & Summary

Greater scaup (Aythya marila), the diving duck of the family Anatidae, which are distributed across Eurasia and North America. A. marila is one of the few Anatidae that breed in the Arctic, and the only duck belonging to the Aythya. A. marila which is a typical winter migrant. In winter, they migrate to the south to overwinter in shallow waters near the coast, estuaries, inland lakes, reservoirs, etc1,2,3. Little attention has been paid to this species because of extremely large range and population size. In fact, the population of A. marila has been declining in the past decades due to lack of effective conservation and reduced food quality4,5.

In recent years, with the development of high-throughput sequencing technology, genomes of more birds have been sequenced, providing a rich genetic resource for the study of species evolutionary mechanisms6,7,8,9,10. However, the published data does not include this species. Studies at the genetic level in this species were extremely rare due to lack of attention, and other few relevant studies were also related to diet, populations, and avian viruses4,5,11.

In this study, we present the first genome assembly of A. marila based on Nanopore sequencing and Hi-C sequencing technology. We used Nanopore sequencing data, Illumina sequencing data, and Hi-C sequencing data to assemble a genome (1,135.19 Mb) with scaffold N50 of 85.44 Mb, and 98.28% of the assembled bases were associated with the 35 chromosomes. This genome will be a valuable resource for many studies, including convergent evolution, population genomics, adaptive evolution, and comparative genomics, among others.

Materials and methods

Ethics statement

All animal experimental procedures were approved by the Biomedical Ethics Committee of Qufu Normal University (approval number: 2022001).

Sampling and sequening

The experimental sample is a wounded male duck found during the wild bird survey in Jiangsu, China, which died unexpectedly during rescue. We dissected the sample and used the muscle for genome sequencing. Additionally, six transcriptomic samples (heart, kidney, liver, lung, pancreas, and spleen) from the same individual were collected and stored in an ultra-low temperature refrigerator at −80 °C, until samples were used for extracted genomic DNA and RNA.

Using Illumina® TruSeq® Nano DNA Library Prep kits to generate sequencing libraries of genomic DNA, we constructed a paired-end library with an insertion size of 350 bp, which was sequenced using Illumina HiSeq 4000 platform. Finally, a total of 60.77 Gb data (coverage of 45.30×) of 150 bp paired-end reads were generated. Simultaneously, using the PromethION platform with single-molecule real-time sequencing to sequence long reads. 122.55 Gb data were obtained, which covered 91.36 fold of the greater scaup’s genome. To assemble a chromosome-level genome of A. marila, a high-through chrome configuration capture (Hi-C) library was built and sequenced using Illumina PE150 platform. In total, we obtained 63.43 Gb raw reads. The six transcriptomic samples were used for RNA and sequenced using Illumina NovaSeq 6000 Platform, 43.9 Gb raw reads of all samples were obtained and used for subsequent gene prediction.

Genome size estimation and assembly

To estimate the genome size, heterozygosity and the proportion of repetitive sequences in A. marila, we performed K-mer analysis using Illumina sequencing data by jellyfish (v2.2.7). While K = 17, the estimated genome size was 1,341.4 Mb, the heterozygosity was 0.47%, and the proportion of repetitive sequences was 42.28% (Table 1). Then, we used NextDenovo (v2.4.0) ( software to assemble the genome with Oxford nanopore technologies (ONT) long reads, and NextPolish12 (v1.3.1) was used to increase the precision of single base with Illumina short reads. We obtained a preliminarily assembly with size of 1,135.19 Mb after de novo assembly and base correction, which included 194 contigs, contig N50 was 34.42 Mb, and the proportion of GC was 41.94%.

Table 1 Genomic characteristics statistics while K = 17.

ALLHiC13 (v0.9.8) and Juicebox (v1.11.08) software were used to mount the contigs in preliminarily assembly onto chromosomes. Contigs were sorted, pruned, and optimized based on the signal strength after Hi-C data were utilized to identify the interaction signals between Contigs. The interaction signals were then shown as a heat map using ALLHiC (Fig. 1). Using Juicebox to correct the position of contigs according to the intensity of chromosome interaction, and finally get a chromosome-level genome assembly with scaffold N50 of 85.44 Mb (Table 2). A total of 106 contigs were mapped to 35 chromosomes with lengths ranging from 2.11 Mb to 207.93 Mb, and 98.28% of the sequences were mapped to the chromosomes (Table 3).

Fig. 1
figure 1

Visual heat map of the Greater Scaup’s Hi-C assembly. (a) The global heat map of all the chromosomes. (b) Heat maps of each chromosome.

Table 2 Assembly statistics of A. marila.
Table 3 Summary of assembled 35 chromosomes of A. marila.

Assessment of the genome assemblies

To assess the accuracy of genome assembly, we used Burrows-Wheeler aligner14 (BWA) (v0.7.8) to map Illumina reads to the genome of A. marila. The reads matching rate was approximately 98.80%, indicating a good agreement between the reads and the assembled genome. Merqury15 (v1.3) was ran to evaluate assembly quality value (QV), and a high QV (42.14) indicates that this assembly is of good quality. Benchmarking Universal Single-Copy Orthologs16 (BUSCO) (v5.4.4) (use option “--augustus”) and Core Eukaryotic Genes Mapping Approach17 (CEGMA) (v2.5) were also used to assess the integrity of the assembled genome. The BUSCO results showed that 97.0% of the complete BUSCOs and 0.8% of the fragmented BUSCOs were found in 8338 single-copy orthologs of avers_odb10, and 2.2% of BUSCOs was missing (Fig. 2). Moreover, 238 of 248 core eukaryotic genes were detected using CEGMA. The above results indicate that the genome assembly of A. marila was complete and of high quality.

Fig. 2
figure 2

The BUSCO evaluation result of the Greater Scaup’s genome assembly.

Comparison with tufted duck genome assembly

Mummer18 (v4.0.0) was used to identify the synteny between A. marila and tufted duck19 (Aythya fuligula) genomes to determine orthologous chromosome pairs, and we used TBtools20 (v1.112) to draw the synteny between their chromosomes. The synteny analysis shows that the chromosome sequences of the two species have good synteny, and each chromosome corresponds to multiple chromosomes of another species. There is one sex chromosome Z (Chr4) in A. marila, and it’s multiple autosome have some synteny with the sex chromosome W of A. fuligula (Fig. 3). Most of the chromosomes can correspond to the chromosome of the tufted duck, however both Chr31 and Chr32 shared synteny with chromosome 34 of the tufted duck, and the synteny was poor (Fig. 3d). Simultaneously, they also have synteny with some contigs. This may be due to the poor assembly quality caused by too small chromosomes size (Chr31, 2.05 Mb; Chr32, 1.03 Mb). In gap analysis, the chromosomes of A. fuligula genome assembly had a higher proportion of N bases (Fig. 3c). In general, the genome we assembled was of high quality.

Fig. 3
figure 3

Chromosome-level synteny analysis of A. marila and A. fuligula, Amar means A. marila, Aful means A. fuligula. (a) Chromosomes length of A. marila and A. fuligula (Mb). (b) GC ratio (%) by 500 kb window. (c) Gap ratio (%) by 500 kb window. (d) The synteny plot of remove the non-orthologous links.

Annotation of repetitive sequences

We used a combined approach of de novo prediction and homology-based prediction to annotate repetitive sequences in the genome of A. marila. For de novo prediction, we used Tandem Repeats Finder21 (TRF) (v4.09) to detect tandem repeat sequences in the genome. RepeatModeler (v2.0.3), RepeatScout22 (v1.0.6) and RECON (v1.08) were used to build a database of transposable element (TE) family for de nove prediction. RepeatProteinMask and RepeatMasker (v4.1.2-p1) were used for homology prediction with Repbase database23 and Dfam database24, the species parameter used was chicken. Ultimately, we identified 154.94 Mb of repetitive sequences by all methods, accounting for 13.65% of the assembled genome (Table 4).

Table 4 Classification of repeat elements in A. marila genome.

Gene structure prediction

We used three methods to predict the protein-coding sequences in the genome of A. marila, ab initio prediction, homology-based prediction and RNA-Seq prediction. Three ab initio prediction software were used in the study, Augustus25 (v3.3.2), GlimmerHMM26 (v3.0.4), and Geneid27 (v1.4.5). For the six transcriptomic samples, the raw data were quality controlled and filtered out bad reads using fastp28 (v0.23.1) to obtain clean data. Then, the paired-end reads are assembled using SPAdes29 (v3.15.3). Next, the candidate coding regions in the transcript sequences are identified using TransDecoder30 (v5.5.0), and the sequences were clustered using CD-hit31 (v4.8.1) to remove redundant sequences. Eventually, the obtained protein sequences were used for subsequent prediction along with other downloaded sequences of related species. Genomes and annotation files of six related species19,32,33,34,35,36 (Anser cygnoides, Anas platyrhynchos, Aythya fuligula, Cygnus olor, Cygnus atratus, Gallus gallus) were downloaded from NCBI database to extract the longest transcript and translated them into protein sequences. Their protein sequences were matched to A. marila genome using Spaln37 (v2.4.6), and then the matched sequences were accurately spliced against the homologous protein-coding sequences using GeneWise38 (v2.4.1) software. Homology-based prediction results, RNA-Seq prediction results and ab initio prediction results are integrated using EvidenceModeler39 (EVM) (v1.1.1) to generate gene sets, and including masked repeats as input to the gene predictions. There were 19,257 genes after EVM integration. The BUSCO results showed that complete BUSCOs was 88.2%, fragmented BUSCOs was 1.5%, and missing BUSCOs was 10.3%, which was lower than the BUSCO results of genome (Fig. 4a). Thus, for some single-copy orthologs of aves_odb10, that ware complete in genome but missing or fragmented in predictive gene set, their annotation file were integrated into the gene set. Overall, a total of 19,533 protein-coding genes were predicted in A. marila genome, including 159 prematurely terminated genes and 273 partial genes. The average transcript length of this gene set was 23,126.79 bp, and the average coding sequence (CDS) length was 1,558.71 bp (Table 5). The BUSCO results showed 95.5% of the complete BUSCOs, 0.3% of the fragmented BUSCOs and 4.2% of the missing BUSCOs (Fig. 4b).

Fig. 4
figure 4

The evaluation result of BUSCO. (a) BUSCO result for 19,257 genes. (b) BUSCO result for 19,533 genes.

Table 5 The statistics of protein-coding genes annotated in A. marila genome.

Gene function annotation

Next, function annotation of predicted genes for A. marila using diamond40 against SwissProt41, TrEMBL42, Non-Redundant Protein Sequence Database (NR), Gene Ontology42 (GO), and Kyoto Encyclopedia of Genes and Genomes Orthology43 (KO) databases with e-value cutoff of 1e-5. The InterPro database44 is also used for function annotation by classifying proteins into families and predicting domains and important sites with Interproscan45 (v5.53). Finally, 15,789 genes were annotated, and accounted for 81.50% of all predicted genes of A. marila (Table 6).

Table 6 Annotation of predicted genes.

Filtering and verification of gene set

We filtered the predicted gene set, because the number of predicted genes was more than in the six related species, and more than 3,000 genes could not be function annotated. We used OrthoFinder to find the orthologs of A. marila and six related species. There were 4,086 unassigned genes, and 3,421 of them were not annotated in any database (Table 7). Most of these genes (3,417/3,421) were predicted using at least one of the de novo prediction software, and only four genes are supported by other evidence. The BUSCO test was performed after removed unassigned genes without function annotations, the results showed no effect compared with before removed (Table 8; Fig. 4b). These suggests that these genes may only have specific sequences and patterns, rather than real genes46. After filtering the unassigned genes without annotation and 159 prematurely terminated genes, 15,953 genes remain, including 182 partial genes, and 98.96% of them were annotated. The filtered gene set of A. marila was compared with the longest transcript set of related species, it has similar structure (Table 9 and Fig. 5). Except G. gallu, the average transcript length, average CDS length, average exons number, and average intron length of A. marila were smaller than those of others. Simultaneously, the average exon length of A. Marila was the longest among all species. However, these differences were not significant. This may be due to A. Marila has a higher proportion of short genes (Table 5a).

Table 7 Orthofinder Statistics Result of PerSpecies.
Table 8 Busco results of the filtered gene set.
Table 9 The comparison of the genetic structure from A. marila genome and related species.
Fig. 5
figure 5

Comparison graph of the prediction gene in A. marila and related species. (a) Gene length distribution and comparison with other species. (b) Exon length distribution and comparison with other species. (c) Exon number distribution and comparison with other species. (d) CDS length distribution and comparison with other species. (e) Intron length distribution and comparison with other species.

Data Records

All sequencing data had been uploaded to NCBI database

The genomic Illumina sequencing data, Nanopore sequencing data and Hi-C sequencing data were deposited at the Sequence Read Archive database of NCBI (accession numbers: SRR2167222547, SRR2167222348, and SRR2167222449).

The transcriptome data of 6 samples were deposited at the Sequence Read Archive database of NCBI (accession numbers: SRR21700073 - SRR2170007850,51,52,53,54,55).

The genome assembly of A. marila were deposited at GenBank database under the accession JAOQIG00000000056.

The annotation results of repeats sequences, gene structure and functional prediction were deposited at the Figshare database57.

Technical Validation

Quality control was performed on Illumina, Nanopore, Hi-C, and RNA sequencing data. The Q20 of Illumina sequencing data were greater than 96% and Q30 were greater than 90%. The Q7 of Nanopore sequencing data were greater than 92%. The Q20 of Hi-C sequencing data were greater than 94% and Q30 were greater than 96%. And the Q20 of RNA sequencing data were greater than 97%, and Q30 were greater than 93%. These evidences suggest that the data are reliable and can be used in subsequent analyses (Tables 10, 11).

Table 10 Statistical information of Illumina, Hi-C and RNA sequencing data.
Table 11 Statistical information of Nanopore sequencing data.