Draft genome assemblies of four manakins

Manakins are a family of small suboscine passerine birds characterized by their elaborate courtship displays, non-monogamous mating system, and sexual dimorphism. This family has served as a good model for the study of sexual selection. Here we present genome assemblies of four manakin species, including Cryptopipo holochlora, Dixiphia pipra (also known as Pseudopipra pipra), Machaeropterus deliciosus and Masius chrysopterus, generated by Single-tube Long Fragment Read (stLFR) technology. The assembled genome sizes ranged from 1.10 Gb to 1.19 Gb, with average scaffold N50 of 29 Mb and contig N50 of 169 Kb. On average, 12,055 protein-coding genes were annotated in the genomes, and 9.79% of the genomes were annotated as repetitive elements. We further identified 75 Mb of Z-linked sequences in manakins, containing 585 to 751 genes and an ~600 Kb pseudoautosomal region (PAR). One notable finding from these Z-linked sequences is that a possible Z-to-autosome/PAR reversal could have occurred in M. chrysopterus. These de novo genomes will contribute to a deeper understanding of evolutionary history and sexual selection in manakins.


Background & Summary
Manakins (Aves: Pipridae), a family of Passeriformes, contain 17 genera and about 50 species distributed across the Neotropics, and have some unique behavioral and morphological features 1 . Most species in the family have sexual dimorphism in plumage color 2 and are polygynous 3,4 . Moreover, the complex courtship displays of males, which include high-speed movements, sophisticated acrobatics, coordinated movements of multiple males, mechanical and vocal sounds and constructed display site construction 5,6 , makes this lineage a fascinating model for studying sexual selection. During mating periods, males hold territories or aggregate for competitive displays to attract females for the chance to mate 7 . Courtship varies substantially among genera and species [8][9][10][11] . For example, in genus Chiroxiphia, one male forms a partnership with another male and they perform elaborate courtship dances and sing common songs together 12 . In contrast, Corapipo gutturalis does not cooperate with other males during courtship displays 13 . Xenopipo atronitens males elaborate courtship displays by making mechanical sounds through flapping their wings 14 , whereas Lepidothrix coronata males sing to attract females in addition to acrobatic displays 15 .
Courtship behavior plays an important role in attracting the opposite sex, increasing the chance of producing offspring and improving the reproductive rate of birds 2,16 . At present, the courtship display of manakin species has been studied from the aspect of behavior observation 17,18 , neuroendocrine 14,19,20 and physiology 2 . The genetic mechanisms have also been discussed [21][22][23][24] , yet insights are lacking due to a lack of comparative genomic and transcriptomic data. As courtship displays are derived from sexual selection 25,26 , we expect that investigating the evolution of their genomes, particularly the sex chromosomes, could bring insights to the understanding of underlying genetic mechanisms. To address this knowledge gap, we conducted whole genome sequencing of four species representing four manakin genera: C. holochlora, D. pipra, M. deliciosus and M. chrysopterus 27,28 . Genome sizes of these four manakin species were estimated to be 1.15 Gb, the contig N50 ranged from 125 Kb to 212 Kb, and the scaffold N50 ranged from 18.4 Mb to 36.6 Mb. We annotated about 12,055 protein-coding genes on each manakin genome. On average, 99.97% of the predicted protein-coding genes were successfully annotated by three functional databases (SwissProt, InterPro, and KEGG). About 75 Mb of Z-linked sequences, including an ~600 Kb PAR, were identified from the available female manakin genomes, including two published species (Corapipo altera and Neopelma chrysocephalum). These genomic resources will benefit research on genetic mechanisms of manakin courtship displays, and other behavioral and ecological aspects.

Methods
Sample collection, library construction, and sequencing. Tissue samples of four manakin species (C. holochlora, D. pipra, M. deliciosus and M. chrysopterus) were provided by the Natural History Museum of Denmark. High-molecular-weight genomic DNA of these samples was extracted with the Kingfisher Cell and Tissue DNA Kit Protocol. Single tube-Long Fragment Read (stLFR) technology 29 was used to construct the libraries for each sample. The resulting libraries underwent DNA Nanoball (DNB ™ ) generation and DNBSEQ sequencing in 100 + 100 + 30 mode. On average, 149 Gb raw reads were produced for each species (Table 1).
Genome assembly and quality evaluation. A series of filtering steps was applied to these stLFR reads prior to the downstream analyses using SOAPfilter2 package (v2.2).
1. Remove reads with more than 10% of N bases; 2. Remove reads with more than 40% low quality bases (Phred score < = 10); 3. Remove reads with undersize insert size; 4. Filter out the PCR duplicates.
All cleaned stLFR library reads were transformed into 10X Genomics linked-reads format and passed into Supernova software (v2.0.1) 30 to assemble the genome under the "pseudohap" mode for each species. After removing scaffolds with "N" >80%, GapCloser (v1.12) 31 was used to close the intra-scaffold gaps.
The size of the four assembled genomes are about 1.15 Gb, similar to the sizes of other avian genomes 32 (Fig. 1a, Table 2). The scaffold N50 of all species is higher than 18 Mb, with the largest scaffold N50 found in M. chrysopterus (36 Mb). The contig N50 of all species is higher than 124 Kb. (Fig. 1b, Table 2).  33 to evaluate the completeness of these seven manakin genomes using aves_ odb10 as the reference gene set. On average 92% of the core genes were assembled as complete single-copy genes in the four manakin genomes and only about 3% of the core genes could not be annotated on the four manakin genomes (Fig. 1c, Table 2). Therefore, the overall quality of the newly assembled genomes was high and comparable to other published manakin assemblies.
Repeat annotation. Tandem repeats were identified by Tandem Repeat Finder (TRF, v4.09.1) 34 , and transposable elements (TEs) were annotated using a combination of homology-based RepeatMasker (v4.1.2) 35 , and de novo methods with RepeatModeler (v2.0.2a) 36 and LTR_Finder(v1.07) 37 . The homology-based annotation of TEs was performed by RepeatMasker with its built-in library. RepeatModeler and LTR_Finder methods were used to build the de novo repeat library for each species, which was further used by RepeatMasker to predict repeats for each species.
We found that the four species contained an average of 9.79% TEs in the genomes, with the proportions of each type being similar across these species (Fig. 2, Table 3). Long Interspersed Nuclear Elements (LINEs) accounted for most TEs, occupying about 6.79% of the genome.
Protein-coding gene annotation. We applied the homolog-based approach to annotate the protein-coding genes by using the protein sequences of Gallus gallus, Taeniopygia guttata and Homo sapiens downloaded from Ensembl release 105 as the reference gene sets. The protein sequences of these reference genes were aligned to each genome using TBLASTN (v2.2.26) 38 with an e-value cut off 1e-5, and multiple adjacent hits of the same query were connected by genBlastA (v1.0.4) 39 . Homologous blocks with length greater than 30% of the query protein length were retained. The connected hit region was later extended to include its 2 Kb upstream and downstream flanking regions, on which gene structure was predicted by Genewise (v2.4.1) 40 . MUSCLE (v3.8.31) 41 was then used to align the annotated protein with the reference protein. Predicted proteins with length ≥30 amino acids and identity value ≥40% were retained. Pseudogenes (annotated genes containing >2 frame shifts or >1 premature stop codon) and retrogenes were further removed. www.nature.com/scientificdata www.nature.com/scientificdata/ To build a non-redundant gene set, we first used hierarchical clustering 42 to combine the homologous-based gene sets of G. gallus and T. guttata. The gene model with the highest identity to the query was preserved if a locus has been annotated with more than one gene model. By doing so, we obtained 8,250 protein-coding genes on average after removing the highly duplicated genes (genes had >10 duplicates, were single-exon genes, and overlapped with the repeats in >70% of coding region). In the end, the newly annotated loci from the human gene set, i.e., the gene model did not overlap with the above combined one, were added into the results. In   www.nature.com/scientificdata www.nature.com/scientificdata/ summary, we predict an average of 12,055 protein-coding genes for each manakin with an average gene length of 22,952 bp. (Table 4).
Gene function annotation. The translated gene coding sequences were aligned to the SwissProt database (release-2020_05) 43 using BLASTP (v2.2.26) 38 with e-value cutoff 1e-5. The best match was assigned as the function annotation for each gene. Motifs and domains of each gene was annotated with modules PRINTS, SMART, PANTHER, ProSiteProfiles, ProSitePatterns, CDD, SFLD, Gene3D, SUPERFAMILY, and TMHMM of InterPro (v5.52-86.0) 44 . To identify the pathways in which genes may be involved, we also aligned the protein sequence of each gene to the KEGG database (release-93) 45 using BLASTP (v2.2.26) 38 with e-value cutoff 1e-5. Overall, 99.97% of the protein-coding genes of the four manakin genomes were annotated by the functional databases (Table 5).
Orthology assignment and phylogeny inference. To reconstruct the phylogenetic history of the seven genera in manakins, we chose one representative species for each genus, including the four species in this study and three published species (C. altera: GCF_003945725.1, M. vitellinus: GCF_001715985.3 and N. chrysocephalum: GCF_003984885.1). T. guttata (GCF_003957565.2) and Calypte. anna (GCF_003957555.1) were used as outgroups. The protein-coding gene sets of these species were obtained from NCBI. We used the T. guttata gene sets as the reference and performed a BLASTP (v2.2.26) 38 search on the protein sequences with an e-value cut-off of 1e-5. The reciprocal best hit (RBH) orthologs between T. guttata and every other species were identified following the published literature 46 but without the evidence of genomic synteny. In total, we obtained 9,654 one-to-one orthologs of these nine species by merging pairwise orthologs according to the reference T. guttata gene set.
The phylogeny of nine species was inferred based on the coalescent-based method, ASTRAL-III (v5.14.2) 47 . First, ortholog alignments were generated as follows: (1) we aligned the protein sequences with MAFFT L-INS-I (v7.487) 48 ; (2) we used trimAl (v1.4.rev15) 49 to achieve a column-based alignment filtering with the parameter "automated", i.e., a heuristic selection of the automatic method based on similarity statistics; and (3) the nucleic acid alignments were back-translated from the trimmed protein alignments. After these steps, we obtained 9,653 trimmed ortholog alignments containing 805,481 parsimony informative sites in total. Then, we inferred the gene tree for each ortholog alignment using IQ-TREE (v1.6.12) 50 with ModelFinder 51 function to determine the best-fit model. The output gene trees were next used as the input for ASTRAL-III (v5.14.2) 47 with default parameters to infer the species tree shown in Fig. 3. As ASTRAL-III measures the branch lengths in coalescent units, we further ran RAxML (v8.2.12) 52 under GTR + GAMMA substitution model to estimate the branch lengths in substitution per site for the concatenated ortholog alignments by specifying the ASTRAL species tree (Fig. 4a). We also used DiscoVista 53 to analyze the discordance frequencies between the ASTRAL species tree and the 9,653 gene trees (Fig. 4b). The frequency of three potential topologies is inferred based on the focal internal branches of the species tree with the main topology (in red) and alternative topologies (in blues). More phylogenetic discordance can be observed in branch 5. Specifically, the frequency of the gene trees that support C. holochlora or M. vitellinus as the sister clade to D. pipra and M. deliciosus is close (Fig. 4c). In contrast to our species tree based on the coding regions, the UCEs-based topology published by Leite et al. 2021 concluded M. vitellinus as the sister clade to D. pipra and M. deliciosus 54 . Previous studies have suggested that such topological differences could result from data-type effects 55,56 . As in Leite et al. 2021 study, the UCE-based and exon-based topologies were not consistent either. Considering that our result still differed from their reported tree even based on coding regions, we assumed that such conflicts of C. holochlora and M. vitellinus could be caused by their limited parsimony informative sites, our restricted number of species, or the evolutionary forces (e.g.   www.nature.com/scientificdata www.nature.com/scientificdata/ introgression and incomplete lineage sorting). More whole genome resources are needed to solve the phylogenomics of these genera. Selection analysis of plumage color related genes. Manakins are characterized by a variety of plumage colors [57][58][59] . To explore the possible genetic mechanism of the color diversity, we investigated the signatures of selection on 37 orthologous genes related to plumage color reported in previous studies [60][61][62][63][64][65][66][67][68][69] . With www.nature.com/scientificdata www.nature.com/scientificdata/ our phylogenetic tree, the maximum likelihood estimation of dN (non-synonymous substitution rate), dS (synonymous substitution rate), and ω (dN/dS) for each gene was performed under two branch models, one-ratio model (H0) and free-ratio model (H1), by using codeml program in PAML package (v4.9) 70 . Likelihood ratio test was used to test if H1 was significantly better than H0, and the output p-values were next corrected with the false-discovery rate (FDR) method. Under FDR-corrected p-value cutoff 0.05, if a branch showed ω > 1 in the branch model analysis, the gene was considered to be positively selected at this branch. We further filtered results with abnormally high ω values (ω > 3) 71 . We finally obtained four genes, TBC1D22A, EDA, SLC45A2 and GOLGB1, that were likely to have undergone positive selection during manakin evolution. Among them, SLC45A2 was found to be positively selected in M. deliciosus. The gene encodes a transporter protein that mediates melanin synthesis 66 . As pheomelanin is responsible for brown and reddish coloration 72,73 , the positive selection signal in M. deliciosus may explain its unique reddish-brown body plumage among other studied manakin species. The other three genes were found under positive selection in the internal branches. TBC1D22A was positively selected in the most recent common ancestor (MRCA) of M. deliciosus, D. pipra and C. holochlora (branch 5 in Fig. 4a). EDA was positively selected in the MRCA of M. deliciosus, D. pipra, C. holochlora and M. vitellinus (branch 4 in Fig. 4a). GOLGB1 was positively selected in both M. chrysopterus and MRCA of M. deliciosus, D. pipra and C. holochlora (branch 5 in Fig. 4a).
Sex chromosomes. Unlike mammals where males are heterogametic (XY system), in birds the females are heterogametic (ZW system). The avian ZW chromosomes are evolved from a pair of ancestral autosomes about 102 million years ago 74 . During evolution, the differentiation of sex chromosomes is caused by recombination arrests on the W chromosome, resulting in the reduction of functional genes on the chromosome and the accumulation of repetitive elements. The Z and W chromosomes of the extant Neoaves are of great differences in length and gene content 74 . Only a small PAR remains for recombination during cell division in females 74 .
We first confirmed the sex of the manakin samples by mapping the sequencing reads of the same individual to its genomes with BWA MEM (v0.7.17) 75 . Coverage information extracted by samtools (v1.9) 76 was calculated in 5 Kb non-overlapping windows with bedtools (v2.29.2) 77 and normalized by the peak coverage. We also softmasked the genomes and performed LASTZ(v1.04.00) 78 alignment with the manakin genomes using the T. guttata genome as a reference with parameter set '--step = 19 --hspthresh = 2200 --inner = 2000 --ydrop = 3400 --gappedthresh = 10000 --format = axt' . Based on the assumption that Z chromosomes are relatively conserved among avians, scaffolds mapped to the Z chromosome of T. guttata with the aligning rate >50% were treated as candidate Z-linked sequences.    www.nature.com/scientificdata www.nature.com/scientificdata/ sequences were then visualized to check the sex of the sequenced individuals. We confirmed most of the sex information was consistent with records except M. vitellinus (BioSample SAMN02299332). This sample is more likely to be a male instead of a female. Its normalized coverage distribution was similar between the Z and not_Z sequences, with both peaks at around one and without a rise at 0.5 (Fig. 5).
With the above procedures we identified about 75 Mb of Z-linked scaffolds containing 585 to 751 genes in the manakins species where a female was sequenced (Fig. 5, Table 6, Supplementary Tables 1 and 2). We further constructed these Z-linked sequences into pseudo-Z-chromosomes for visualization with Ragtag (v2.1.0) 79 using T. guttata Z chromosomes as reference under parameter set "-q 10 -d 100,000 -i 0.2 -a 0.0 -s 0.0 -g 100 -m 100000 -aligner minimap2 -mm2-params '-x asm5'". To obtain the genomic coordinate of the avian candidate sex determining gene DMRT1 80 , we used the DMRT1 protein sequence of G. gallus downloaded from UniPort as a query, and annotated the orthologous genes on the pseudo-Z-chromosome of manakins using Genewise.
We also used the normalized coverage to identify PAR in the genomes assembled from female individuals. Z-linked scaffolds with normalized depth greater than 0.7 were identified as PAR candidates. We found that PAR is conserved between manakins and T. guttata with length of about 600 Kb and containing about 16 genes. However, one exception was found in M. chrysopterus where the candidate PAR is 30 Mb and contains 228 genes (Fig. 5, Table 6 & Supplementary Table 1). Most of the 30 Mb region has become differentiated region (DR) in the most recent common ancestor of Neoaves for about 69 million years 74 , as well as the other manakins in this study. Thus, it is more likely that the region has reverted back to PAR or even autosome in M. chrysopterus. Such reversal is rare but has been found in other species 81,82 . Further exploration is required for the mechanism and explanation of this possible reversal.

Technical Validation
The assemblies of four manakins used in this study are the first version of the species. The average length of scaffold N50 and contig N50 were 29 Mb and 169 Kb, respectively. BUSCO analysis evaluated the genome assembly completeness. In total, about 95.23% core genes were assembled as complete genes of the four manakin genomes (single ~92.075%, duplicated ~3.200%, fragmented ~1.725%, missing ~ 3.000%). These results are comparable to those of three previously published manakins (Corapipo altera, Manacus vitellinus, and Neopelma chrysocephalum).

code availability
The version and parameters of bioinformatic tools used in this study have been described in the Method section. If no parameter is described, the default is used.  Table 6. Z chromosome statistics. *We suggest a differentiated region-to-autosome/pseudoautosomal region reversal has happened in this species.