Abstract
Przevalski’s partridge (Alectoris magna) is one of the birds in the genus Alectoris endemic to China. The distribution of A. magna was narrow, and it was only found in parts of the Qinghai, Gansu, and Ningxia provinces. A. magna was considered a monotypic species until it was distinguished into two subspecies. However, external morphological characteristics, rather than genetic differences or evolutionary relationships, are now commonly used as evidence of subspecies differentiation. In this study, a chromosome-level reference genome of A. magna has been constructed by combining Illumina, PacBio and Hi-C sequencing data. The 1135.01 Mb A. magna genome was ultimately assembled. The genome showed 96.9% completeness (BUSCO), with a contig N50 length of 23.34 Mb. The contigs were clustered and oriented on 20 chromosomes, covering approximately 99.96% of the genome assembly. Additionally, altogether 19,103 protein-coding genes were predicted, of which 95.10% were functionally annotated. This high-quality genome assembly could serve as a valuable genomic resource for future research on the functional genomics, genetic protection, and interspecific hybridization of A. magna.
Similar content being viewed by others
Background & Summary
Birds of the genus Alectoris are currently divided into seven species in total, Most of them are extensively distributed in Eurasia, and the subspecies diverge widely. Specifically, they are distributed as far east as the northern coast of China, as far north as southern Russia, and as far south as the Arabian Peninsula and Mediterranean islands1,2, and they were later introduced to Britain and the United States3,4.
A. magna is one of seven species in the genus Alectoris5 and is endemic to China. Przevalski’s partridge (Alectoris magna), which belongs to the family Phasianidae and genus Alectoris, is distributed only in the Qinghai, Gansu, and Ningxia provinces of China. Therefore, the distribution area is relatively narrow. Nevertheless, few studies have been conducted on A. magna in China. Large areas of land are presently being reclaimed for farmland in the already narrow distribution area of A. magna, while habitat conditions are deteriorating because of overhunting and the development of agriculture and animal husbandry6,7. In 2021, A. magna was listed on the second level of the Chinese List of National Key Protected Wildlife. The two subspecies of A. magna diverged about 500,000 years ago, there are significant differences in sequence variation between them, no shared haplotype and lack gene flow. A complete assembled genome would contribute to refining the reference criteria for subspecies differentiation. According to research, there is an asymmetric introgression between the two kinds of partridges (Alectoris magna and Alectoris chukar), which makes it difficult to correctly identify the species based only on morphology and also affects the genetic integrity of the existing species8,9,10. The resulting hybrids presented the characteristic of A. magna in morphology, nevertheless, it had a genotype similar to that of A. chukar. It was speculated that the genes of A. chukar might have flowed into the gene pool of A. magna, which would interfere with sampling and sequencing. Previously, the complete mitochondrial genome of the mountain chukar was determined, providing basic data for genetic research on this endangered species6. Currently, whole-genome data and resources, can provide a foundation for following researches on the origin, subspecies division, population dynamics, and genetic conservation of A. magna.
In this study, a high-quality chromosome-level genome of Przevalski’s partridge was generated by integrating PacBio HiFi, Illumina paired-end sequencing, and high-throughput chromatin conformation capture (HiC) technology. The final combined A. magna genome had an N50 contig length of 23.34 Mb. A total of 19,103 protein-coding genes were predicted, of which 95.10% were functionally annotated. The reference genome acquired in this study may serve as a valuable resource for future research on A. magna.
Methods
Sampling and sequencing
An adult specimens of A. magna was originally selected from Lanzhou, China. Blood obtained through jugular vein sampling were used for DNA extraction as well as genome sequencing and assembly. All the blood samples were freshly frozen and stored in liquid nitrogen until they were used for DNA extraction. The animal used in this study was reviewed and ratified by the Experimental Animal Welfare and Ethics Review Committee of Yantai University, Shandong, China.
Following the manufacturer’s protocols, whole genomic DNA was extracted by means of an E.Z.N.A. ® Blood DNA kit (OMEGA, USA), and sequencing libraries were made utilizing the Truseq Nano DNA Sample Preparation Kit (Illumina, USA). The resulting libraries with an insertion size of 450 bp were quantified using a TBS-380 Miniature fluorometer Picogreen (Invitrogen), sequenced on an Illumina NovaSeq6000 sequencing platform, and produced paired-end reads of 150 bp. Following Illumina sequencing, 66 Gb of raw genomic data for A. magna were obtained (Table 1). Subsequently, quality clipping of the raw data was performed to remove low-quality data and make the subsequent assembly more accurate. The base distribution and mass fluctuation of each circle for all sequencing reads were statistically analyzed using bioinformatics. As shown in the Illumina raw data quality control chart, the sequencing quality of the samples and library construction quality are directly reflected.
After the library construction was complete, HiFi sequencing was performed using PacBio Sequel II. After processing the original data through a series of filters, 34.2 Gb reads with an average length of 14.2 kb passed quality control
To perform chromosome-level genome assembly, a Hi-C library was constructed utilizing the MboI restriction enzyme with a previously described standard protocol11,12. Briefly, after grinding the samples with liquid nitrogen, the cells were treated with formaldehyde to cross-link DNA with proteins. The crosslinked DNA was treated with restriction enzymes to generate sticky ends. The ends were then repaired, and biotin was introduced to label the oligonucleotide ends, which were subsequently ligated with T4 DNA Ligase. Protease digestion was used to remove the cross-linked state, and the purified DNA was broken into fragments 500–700 bp in length. The labeled DNA was captured using streptavidin magnetic beads. The Hi-C libraries were quantified and sequenced on an Illumina NovaSeq 6000, and sequencing data were applied in chromosome-level assembly13.
Genome size estimation and de novo assembly of A. magna
Before genome assembly, analysis, and annotation, we used the K-mer statistics method to estimate genome size based on Illumina sequencing data. The K-mer size was set to 21 to analyze the data and estimate the genome size, heterozygosity, and repetition rate of the obtained samples14. On the basis of a total of 47,071,851,190 21-mers, the genome size was predicted to be 1095.8 Mb; meanwhile, the estimated heterozygosity and repeat rate were approximately 0.86% and 19.2%, respectively (Table 2 and Fig. 1).
PacBio HiFi long reads obtained by sequencing were preliminarily assembled using the HiFi data assembly software Hifiasm (https://github.com/chhylp123/hifiasm). Although the accuracy was high for the HiFi reads, some errors remained. Hifiasm reads all HiFi reads into memory for all-vs.-all alignment and error correction. Based on overlapping information between reads, if there is a base on the read that is different from other bases and it is supported by at least three reads, it is considered an Single Nucleotide Polymorphism (SNP) and retained; otherwise, it will be regarded as an error and corrected. Eventually, the long-read SMRTbell library15 yielded a genome assembly of 1135.01 Mb with a contig N50 of 23.34 Mb, which is similar to the results predicted by K-mer analysis.
Chromosome-level genome assembly and assessment of the genome assemblies
Hi-C-assisted genome assembly was performed using Hi-C scaffolding methods16. Contigs from the previous assembly were clustered, and oriented toward the chromosome scale of the assembly. In total, 113.62 Gb of clean data were yielded from the Hi-C library (Table 1). Because the cis interaction was greater than the trans interaction, the Hi-C-corrected contigs were clustered, oriented, and anchored using an Allhic pipeline17. The final 1102.93 Mb (97.17%) assembled genome sequences were anchored on 31 chromosomes, with a chromosome length that ranged from 0.49 Mb to 198.20 Mb (Fig. 2 and Table 3). Additionally, the heat map of the Hi-C assembly interaction cassette was consistent with high-quality genome assembly (Fig. 3).
GC-Depth was used to evaluate the assembly results and determine whether there was a significant GC bias or sample contamination18. The reads were aligned to the assembled sequences, both the GC content of the sequences and coverage depth of the reads were measured19. Following this, a correlation analysis was performed between GC content and sequencing depth (Fig. 4). In addition, the completeness of the assembly was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO v4.2.1)20,21 with the vetebrata_odb10 database and CEGMA22 software. The results showed that 96.9% (single-copy genes: 96.6%, duplicated genes: 0.3%) of the 8338 single-copy genes were identified as complete, 0.6% were fragmented, and 2.5% were missing from the assembled genome (Table 4). We also gained the integrity of the genome for 91.08% using merqury and the QV value and error rate of the genome obtained were 64.2452 and 3.76251e-07, respectively. In summary, these assessment results indicated that the A. magna genome assembly was of high quality.
Repetitive and non-coding gene prediction
Before predicting and annotating the protein-coding genes, repetitive elements in the A. magna genome were estimated through a combination of homologous comparison and ab initio prediction. The RepeatMasker (https://www.repeatmasker.org/) and Tandem Repeats Finder (https://tandem.bu.edu/trf/trf.html) software were used to identify scattered repeats and search for tandem repeats, respectively. Using RepeatMasker23,24, stray repeats were searched for by aligning the sequence with a database of known repeats (RepBase)25,26. Ultimately, we identified 361.2 Mb of repetitive sequences, including 229.1 Mb of interspersed repeats and 132.1 Mb of tandem repeats, accounting for 31.8% of the assembled genome. Among classified interspersed repeats, long interspersed repeated sequences (LINEs) were the most abundant with a whole length of 82 Mb, whereas rolling circle (RC) were the rarest with a total length of 0.67 Mb, which occupied 0.06% of the whole genome sequences (Table 5).
Region and secondary structure of the tRNAs were predicted using tRNAscan-SE v2.0.727, and BLAST was used to predict the rRNA sequences. A total of 283 tRNAs were predicted using tRNAscan-SE, and 99 rRNA genes were annotated using BLASTN28. Beyond that, the prediction principles for the other three ncRNAs including sRNA, snRNA, and miRNA were similar. First, the Rfam software was utilized to compare and annotate the Rfam database29, and then its cmsearch program with default parameters was used to determine the final sRNA, snRNA, and miRNA.
Protein-coding genes prediction and annotation
The protein-coding genes in the A. magna genome assembly were estimated using a combination of de novo prediction, homologous protein alignment, and transcriptome-based methods. Augustus v3.2330 was used for de novo prediction, and we downloaded the protein sequence of Coturnix japonica (GCF_001577835.2) from NCBI database and used TblastN v2.2.26 with an e-value of 1e−5 to align the protein sequence to the sample genome31. Then, to get an accurate spliced alignment, matching proteins were aligned to homologous genome sequences using GeneWise v2.4.132, which was subsequently used for identification of the gene coding and intron regions. For RNA-Seq prediction, RNA sequencing data derived from blood samples were aligned to the A. magna genome fasta by TopHat v2.1.1 with default parameter33,34, and the alignment results served as inputs for Cufflinks v2.2.1 to predict the gene structure35,36,37. Transcriptome data were concatenated with Trinity v2.11.0 to obtain transcripts38. Subsequently, EvidenceModeler v1.1.1 was used to integrate these gene sets to obtain the coding genes of the sample genome39. As a result, 19,103 protein-coding genes were estimated with a mean Coding sequence (CDS) length of 1561 bp.
The protein sequences of the predicted genes were compared with public biological functional databases, including the Nr, SwissProt40,41, GO42, eggNOG, and KEGG databases43,44, by blastp (BLAST + 2.7.1, comparison standard: e-value no more than 1e−5)37, and functional annotation was performed. Finally, a total of 18,167 genes were successfully annotated using at least one public database, representing 95.1% of the full of predicted genome (Table 6 and Fig. 5).
Data Records
The whole-genome sequencing data (Illumina genomic sequencing reads, PacBio long reads, Hi-C data, and RNA-seq reads) were deposited in the National Center for Biotechnology Information (NCBI) Sequenced Read Archive (SRA) database at NCBI SRR2387579045, SRR2387578946, SRR2387578847, and SRR2572216448. The assembly genome was deposited at DDBJ/ENA/GenBank under the accession JARUNP00000000049. The assembly genome data, repeat sequence prediction and functional annotation results had been stored at Figshare50.
Technical Validation
Data filtering and quality control
Fast QC v0.11.8 was used to determine the quality of the sequences in the initial sequencing data. The original sequencing data contained low-quality reads, high N content, and contaminated adapters. In order to improve the accuracy of the subsequent assembly, Trimmomatic v0.3951 software was used to eliminate these; the specific steps included removing the adapter sequence from reads, pruning the read ends with lower sequencing quality (with a sequencing mass value less than 20), and removing reads containing more than 10% N bases. Eventually, we obtained clean reads stored in the fastq format.
Assembly validation
To ensure the accuracy and continuity of the genome for subsequent annotation and comparative genome analysis, the integrity of the genome assembly must be accurately evaluated after its completion. Three genomic quality assessments were used to comprehensively detect the genome assembly: sequencing depth/coverage, GC distribution, Merqury, and BUSCO assessments. The GC content distribution and sequencing coverage of an assembled sequence were determine based on a GC depth distribution map. Merqury evaluates the genome based on Kmer to obtain consistency quality (QV), genome assembly error and completeness. BUSCO assessment compares homologous genes in the genome assembly results to predict the integrity of the gene regions of the genome assembly, especially conserved gene regions.
Code availability
If no detailed parameters were mentioned, all software and tools in this study were used with their default parameters. No specific code or script was used in the study.
References
John, R. The Clements Checklist of Birds of the World 6th Edition” by James F. Clements. 2007. Can Field Nat 120, 483 (2006).
Carroll, J. P. Pheasants, Partridges, and Grouse: A Guide to the Pheasants, Partridges, Quails, Grouse, Guineafowl, Buttonquails, and Sandgrouse of the World. Forest Sci, 4 (2002).
Khan, H. A., Arif, I. A. & Shobrak, M. DNA Barcodes of Arabian Partridge and Philby’s Rock Partridge: Implications for Phylogeny and Species Identification. Evol Bioinform 6, EBO.S6014 (2010).
Belik, V. P. Faunogenetic structure of the Palearctic avifauna. Entomol Rev 86, S15–S31 (2006).
Randi, E. A Mitochondrial Cytochrome B Phylogeny of the Alectoris Partridges. Mol Phylogenet Evol 6, 214–227 (1996).
Gao, H. et al. The complete mitochondrial genome of Helan Mountain chukar Alectoris chukar potanini (Galliformes: Phasianidae). Mitochondrial DNA B 4, 2443–2444 (2019).
Palmer, W. E. & Carroll, J. P. Pheasants, Partridges, and Grouse: A Guide to the Pheasants, Partridges, Quails, Grouse, Guineafowl, Buttonquails, and Sandgrouse of the World. The Auk 120, 920–921 (2003).
Chen, Y. K., An, B. & Liu, N. F. Asymmetrical introgression patterns between rusty-necklaced partridge (Alectoris magna) and chukar partridge (Alectoris chukar) in China. Integr Zool 11, 403–412 (2016).
Ouchia-Benissad, S. & Ladjali-Mohammedi, K. Banding cytogenetics of the Barbary partridge Alectoris barbara and the Chukar partridge Alectoris chukar (Phasianidae): a large conservation with Domestic fowl Gallus domesticus revealed by high resolution chromosomes. Comp Cytogenet 12, 171–199 (2018).
Barbanera, F. et al. Sequenced RAPD markers to detect hybridization in the barbary partridge (Alectoris barbara, Phasianidae). Mol Ecol Resour 11, 180–184 (2011).
Belton, J.-M. et al. Hi–C: A comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).
Au - van Berkum, N. L. et al. Hi-C: A Method to Study the Three-dimensional Architecture of Genomes. Jove-J of Vis Exp, e1869 (2010).
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 31, 1119–1125 (2013).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Korlach, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Methods Enzymol 472, 431–455 (2010).
Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Research 4, 1310 (2015).
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants 5, 833–845 (2019).
Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 40, e72–e72 (2012).
Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-Content Normalization for RNA-Seq Data. BMC Bioinformatics 12, 480 (2011).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Manni, M., Berkeley, M. R., Seppey, M. & Zdobnov, E. M. BUSCO: Assessing Genomic Data Quality and Beyond. Curr Protoc 1, e323 (2021).
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Curr Protoc Bioinformatics 25, 4.10.11–14.10.14 (2009).
Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Curr Protoc Bioinformatics 5, 4.10.11–14.10.14 (2004).
Kohany, O., Gentles, A. J., Hankus, L. & Jurka, J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics 7, 474 (2006).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6, 11 (2015).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. Nucleic Acids Res 25, 955–964 (1997).
Kent, W. J. BLAT–the BLAST-like alignment tool. Genome Res 12, 656–664 (2002).
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33, D121–D124 (2005).
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003).
Gertz, E. M., Yu, Y.-K., Agarwala, R., Schäffer, A. A. & Altschul, S. F. Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST. BMC Biology 4, 41 (2006).
Birney, E., Clamp, M., Fau, -, Durbin, R. & Durbin, R. GeneWise and Genomewise. Genome Res 14, 988–995 (2004).
Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36 (2013).
Ghosh, S. & Chan, C.-K. K. Analysis of RNA-Seq Data Using TopHat and Cufflinks. Methods Mol Biol 1374, 339–361 (2016).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34, W435–W439 (2006).
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7, 562–578 (2012).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644–652 (2011).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 9, R7 (2008).
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28, 45–48 (2000).
The UniProt Consortium UniProt: the universal protein knowledgebase. Nucleic Acids Res 45, D158–D169 (2017).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet 25, 25–29 (2000).
Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42, D199–D205 (2014).
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28, 27–30 (2000).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR23875790 (2023).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR23875789 (2023).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR23875788 (2023).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR25722164 (2023).
Xia, W. H. Alectoris magna, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:JARUNP000000000 (2023).
Xia, W. H. Whole genome sequencing of Przevalski’s partridge (Alectoris magna). Figshare https://doi.org/10.6084/m9.figshare.22558330 (2023).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Acknowledgements
The study was funded by Research and Development Program of Shandong Province, China, (Major Science and Technology Innovation Project) under No. 2021CXGC011306; The Key Funded with the MNR Key Laboratory of Eco-Environmental Science and Technology, China under No. MEEST-2021-05; Natural Science Foundation Shandong Province under No. ZR2020MD002; The Doctoral Science Research Foundation of Yantai University under SM15B01, SM19B70 and SM19B28; “double-hundred action” of Yantai under No. 2320004-SM20RC02.
Author information
Authors and Affiliations
Contributions
Wang X.M. and Xia W.H. design experiments and wrote the manuscript; Teng X.D. and Lin W.Y. collected the samples; Xing Z.K. and Wang S. extracted the genome DNA; Wang X.M., Xia W.H., Liu X.M., Qu J.Y. performed data analysis. Zhao W. and Wang L.J. conceived the idea, supervised the work, and revised the manuscript. All authors have read and approved the final manuscript for submission.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, X., Xia, W., Teng, X. et al. Chromosome-level genome assembly of Przevalski’s partridge (Alectoris magna). Sci Data 10, 829 (2023). https://doi.org/10.1038/s41597-023-02655-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-023-02655-5