The first set of universal nuclear protein-coding loci markers for avian phylogenetic and population genetic studies

Multiple nuclear markers provide genetic polymorphism data for molecular systematics and population genetic studies. They are especially required for the coalescent-based analyses that can be used to accurately estimate species trees and infer population demographic histories. However, in avian evolutionary studies, these powerful coalescent-based methods are hindered by the lack of a sufficient number of markers. In this study, we designed PCR primers to amplify 136 nuclear protein-coding loci (NPCLs) by scanning the published Red Junglefowl (Gallus gallus) and Zebra Finch (Taeniopygia guttata) genomes. To test their utility, we amplified these loci in 41 bird species representing 23 Aves orders. The sixty-three best-performing NPCLs, based on high PCR success rates, were selected which had various mutation rates and were evenly distributed across 17 avian autosomal chromosomes and the Z chromosome. To test phylogenetic resolving power of these markers, we conducted a Neoavian phylogenies analysis using 63 concatenated NPCL markers derived from 48 whole genomes of birds. The resulting phylogenetic topology, to a large extent, is congruence with results resolved by previous whole genome data. To test the level of intraspecific polymorphism in these makers, we examined the genetic diversity in four populations of the Kentish Plover (Charadrius alexandrinus) at 17 of NPCL markers chosen at random. Our results showed that these NPCL markers exhibited a level of polymorphism comparable with mitochondrial loci. Therefore, this set of pan-avian nuclear protein-coding loci has great potential to facilitate studies in avian phylogenetics and population genetics.

Nuclear Protein-coding Loci (NPCLs) are exons without flanking introns 6 , and are widely used in interspecific phylogenetic studies (e.g. RAG1 7 , c-myc 8,9 ). NPCL markers possess favorable properties including homogeneous base composition, varied evolutionary rates and easy alignment across species or populations 10,11 . Moreover, orthologous genes can be identified accurately using their annotations 12,13 . Several sets of universal NPCL markers had been developed specially for beetles 14 , fish 15 , reptiles 6 , amphibian and vertebrates 16,17 . However, there is still no sufficient number of easily amplifiable NPCL markers that can fulfill the needs of modern coalescent-based analysis for most of bird species. As the most common and species-rich group of terrestrial vertebrates, birds exhibit tremendous diversity in their phenotypes, ecology, habitats and behaviors 18 . So far, a considerable effort has been devoted to resolve the phylogenetic relationships from higher taxonomic categories [19][20][21] to sister species [22][23][24][25][26] . In addition to phylogenetics, modeling-based approaches using multiple nuclear genes have also shed light on population structure and demographic history and allowed inferences of selection pressures in non-model organisms [27][28][29][30] . The rapid advance in these sub-disciplines in evolutionary biology always hinges upon proper sampling design and a rigorous statistical approach, but it also requires data on multiple independent loci with an appropriate level of genetic polymorphism 31 , which allows the application of sophisticated modeling and thus hypothesis testing.
Efforts of developing universal PCR primers have facilitated avian phylogenetic and population genetic studies [32][33][34] . For example, Dawson et al. 35 developed a set of microsatellite markers with high cross-species utility, suitable for paternity and population studies. Backström et al. 36 developed more than 200 exons flanking introns, which were evenly distributed throughout the avian genome. However, a variable number of indels (insertions and deletions) in the intron complicate the subsequent amplification, sequencing and alignment of these exons. Conserved and easily aligned exonic regions are ideal alternatives to compensate for resolving power for phylogenetic reconstruction 13 . Kimball et al. 37 tested the utility of 36 published markers on 42-199 bird species with only five exonic markers therein. Kerr et al. 38 developed 100 exonic markers from five avian genomes, and finally tested a subset of 25 markers in 12 avian orders. The quantity of NPCL markers is far from adequate as exon length should be longer than intron sequences to yield sufficient phylogenetic resolution 39 . Using a small number of universal NPCL markers could increase the probability of error when estimating species relationships due to the conflict of gene tree topologies. To overcome the problem, it has been advocated to use more genes with longer sequences 40 . However, some obstacles have hindered the development of universal NPCL markers. Firstly, widespread flanking introns make the identification of the exon boundaries of a specific NPCL marker difficult 6 . Secondly, multiple nuclear loci are required to be distributed evenly and widely across the whole genome in order to indicate a variety of historical signals. And finally, low-cost and easy amplification are important requisites. The development of a set of universal NPCL markers for birds should significantly reduce the time required for future research as well as its cost, and facilitate the application of coalescent-based methods in avian evolutionary studies.
In this study, we aimed to develop a set of avian universal NPCL markers that can be widely utilized in avian phylogenetic and population genetic studies. By comparing the published genomes of the Red Junglefowl (Gallus gallus) and the Zebra Finch (Taeniopygia guttata), we designed 136 pairs of NPCL primers and amplified them in 41 species representing 23 avian orders to check their versatility. To test the resolving power of these markers, we further constructed a phylogenetic tree and estimated mutation rates by extracting universal NPCLs from 48 published avian genomes 41 . Moreover, samples from four populations of the Kentish Plover (Charadrius alexandrinus) were also amplified to estimate the intra-specific polymorphic level of these universal NPCLs.

Results
Pan-avian order amplifications of the novel NPCLs. The genome alignment and BLAST procedures resulted in 136 NPCL candidates, which were broadly distributed across 24 autosomal chromosomes and the Z chromosome of the Zebra Finch genome. Their original fragment length ranged from 815 bp to 7176 bp (Supplementary Table S1). We thus nominated each NPCL marker using abbreviation of the associated protein-coding regions according to gene annotation of Zebra Finch (Supplementary Table S1). More than one primer pairs were conducted for each NPCL marker candidate, and we finally chose the pair of PCR markers with the highest score denoting the level of conservatism between Zebra Finch and Red Junglefowl genomes.
In total, 5,146 PCRs were performed to amplify the 136 NPCLs in 41 species representing 23 avian orders (Fig. 1A). Among them, 2,875 (55.9%) of PCR performances produced a target band (Supplementary Table S3). For the 136 candidates, we successfully amplified 12 NPCLs in all 23 orders, with 100% PCR success rate (PSR). Sixty three of the 136 candidate NPCL markers had a relatively good overall PCR performance (PSR ≥80%) ( Fig. 2A); all of them were successfully amplified in Caprimulgiformes and Gruiformes, and the PSR ranged from 65% to 97% in other orders (Fig. 1B, Supplementary Table S3). This set of 63 universal avian nuclear markers was distributed across 17 autosomal chromosomes and the Z chromosome (Fig. 3).
Interspecific mutation rate and phylogenic construction of the 63 universal NPCLs. The genomebased BLAST results showed that the widely used genetic markers, cytochrome b (cyt b) of mitochondrial DNA(mtDNA) and RAG1, an extensively used nuclear gene 42,43 were located in all 48 published genomes 41 . For the newly developed NPCLs, we located 56 loci across all 48 avian genomes. Among the remaining seven NPCLs, six of them were located in 47 genomes and two missing data recorded at the locus FUT10. Combined, BLAST results confirmed this set of 63 universal NPCL markers were orthologous among these 48 species (Supplementary Table S4) and the resulting concatenated matrix with sequences of approximately 96 kb was obtained (alignment available at: DRYAD https://doi.org/10.5061/dryad.ht3823d).
The range of the estimated mutation rates for the universal avian NPCLs is broad; it ranged from 0.0997 to 0.7317 × 10 −8 per site per million years (Fig. 2B). Among these 63 NPCLs, the mutation rates of 27 were slower than the mutation rate of RAG1, whilst the other 36 NPCLs were faster. All NPCLs showed a slower mutation rate than that of the mitochondrial cyt b.  PCR success rate distribution for the 136 NPCL marker candidates and mutation rates for the 63 universal NPCL markers. (A) PCR success rate distribution for 136 candidates in 23 avian orders. The 63 NPCL markers with PCR success rates higher than 80% are shown in black; other loci with PSR success rates below 80% (in grey) were excluded from the subsequent analysis. The number above each bar shows the number of NPCL marker candidates. (B) Mutation and success rates for the 63 universal NPCL markers. The markers were sorted according to estimated mutation rates (bars in black) from low to high. The number on the right of each bar is the mutation rate of each NPCL marker. PCR success rates are shown underneath (bars in grey). The mutation rates of widely-used NPCL RAG1 (in green) and mitochondrial gene cyt b (in blue) were selected as references.
Intraspecific polymorphism of 17 randomly selected NPCL markers. A total of 12,420 bp DNA sequences, including 11,196 bp of 17 NPCLs and 1,224 bp of two mitochondrial loci were sequenced in 40 samples representing four populations of the Kentish Plover. The NPCL markers showed varied degrees of polymorphism, with the exception of locus KBTBD8 (Fig. 5). There were 10 polymorphic sites in loci BIRC2 and FMN2, while there were only 1-6 polymorphic sites in other loci. Correspondingly, BIRC2 and FMN2 possessed the highest values of haplotype and nucleotide diversity (mean Hd = 0.84 and 0.92, mean π = 0.0047 and 0.0040, respectively). In contrast, a mitochondrial gene ND3 had only one polymorphic site, yielding a low haplotype diversity (mean Hd = 0.36) and nucleotide diversity (mean π = 0.0009). When we compared the results of interspecific mutation rates and intraspecific polymorphism, we found that the inter-and intraspecific genetic diversity of our gene markers were incongruent; although the estimated mutation rates at the study NPCLs were all much lower than that of mitochondrial gene cyt b, the intraspecific polymorphism at nine NPCL markers was higher than that of two mitochondrial genes.
The genetic polymorphism parameters varied greatly, not only among genes, but among populations as well. For example, the Hd value of the MAML3 gene was lowest in the Taiwan population (Hd = 0.51) and highest in the Qinghai population (Hd = 0.88). The measure for nucleotide diversity, π of the NCOA6 gene was lowest in the Taiwan population (π = 1.59) and highest in the Guangxi population (π = 3.81). Detailed information on measures from each population is available in Supplementary Table S5. The HKA test suggested no departure from the neutral expectation for any of the 17 NPCL markers. Similarly, the test of Tajima's D showed that none of the 17 NPCL markers deviated significantly from neutrality (Supplementary Table S5).

Discussion
We developed a set of 63 avian universal NPCL markers with diverse mutation rates and levels of intraspecific polymorphism. Our results showed that the 63 NPCL markers were successfully amplified in most of the species tested, representing 23 extant orders across major lineages of the avian tree of life (PCR success rate ≥80%), and denoted different levels of inter-and intraspecific polymorphism. Therefore, our NPCLs set will provide a highly versatile genetic toolkit for a broad range of molecular phylogenetic and ecological applications. Moreover, the genetic marker system we provide here is cheap and easy to apply. Any molecular laboratories that are capable of performing PCRs can adopt our marker system effortlessly. Hence, this novel set of universal NPCL markers has great potential to be widely applied in evolutionary biology studies in birds.
Inherited from different chromosomes, concatenation of nuclear markers contributes multiple independent estimates to species trees 47,48 , in order to alleviate the node conflicts of gene trees caused by incomplete lineage sorting, horizontal gene transfer, inconsistent evolutionary rates, gene duplication and/or gene loss and so on 49,50 . We constructed an avian phylogenetic tree using concatenation of 63 NPCLs across 18 chromosomes from 48 genomes. The result is largely similar with previous phylogenomic works using different data types like multiple nuclear loci 41,44 , introns 20 , ultraconserved elements (UCEs) 46,51 and retroposon presence/absence matrix 45 . The congruent parts of topology reveal multiple cluster clades, such as Telluraves (core landbirds), Aequornithia (core waterbirds) and Phoenicopterimorphae, Otidimorphae, Caprimulgimorphae and Phaethontimorphae. We also find some unresolved placements comparing with Jarvis et al. 41 . These include Columbiformes (doves), Pterocliformes (sandgrouses), Mesitornithiformes (mesites), Charadriiformes (plovers) and Gruiformes (cranes), which exhibits hard polytomies in the avian tree of life. Though recent efforts in avian phylogenomic studies using whole-genome 41 or genome-level data 44 , irresolvable relationships have been found in some clades 41,[44][45][46] . Suh et al. 45 investigated the causes of phylogenetic irresolvabilities and concluded that such phylogenetic discordances were originated from prevalent ancestral polymorphism denoted by incomplete lineage sorting (ILS) 52 , which is probably associated with an initial near-K-Pg super-radiation 41 in Neoaves. Unlike the two other main radiations that gave rise to the core waterbirds and core landbirds clades, the massive near-K-Pg super-radiation in Neoaves, containing several unresolved lineages, leads extreme ILS and associated network-like phylogenetic relationships 45,46 . On one hand, again, the topology reconstructed by the present set of universal NPCL markers captures these patterns, and suggests hard polytomies due to biological limitation of phylogenetic methods. On the other hand, it implies that our NPCL markers have sufficient polymorphism to resolve phylogenetic relationships among lineages with less ILS in Neoaves.
We also found that this set of novel NPCL markers has the potential to be applied in population genetic studies, in which researchers usually prefer to use abundant markers with high mutation rates. For example, microsatellites were developed for specific species or orders to detect differences in genotypes and further to quantify intraspecific genetic diversity 35,53,54 . But introns like microsatellites have high levels of length homoplasy 55 . It is commonly assumed that NPCLs are conservative loci, highly suitable to address questions concerning high-level systematics 40 . However, some population genetic studies highlight the importance of using functional exonic SNPs in population genetic studies 11,56 , comparing to neutral markers (such as microsatellite and mitochondrial DNA). Datasets that contain numbers of several to more than 100 exon genes 51,57 can support accurate and SciENTific RepoRts | (2018) 8:15723 | DOI:10.1038/s41598-018-33646-x reliable estimates of population genetic parameters 55,58 , and have a substantial power in population genetic analysis 5,40,50,59 . Studying 17 NPCL markers in the Kentish Plover, we found that sixteen had low to moderate levels of intraspecific polymorphism and nine of them showed higher genetic diversity than mitochondrial genes in this study. Although a previous study showed a low level of genetic differentiation across Eurasian populations 60 , this species exhibits variability in morphology and behavior among and within populations in East Asia 61 , warranting further coalescent-based analysis of their evolutionary history. This dataset provides sufficient information to study population genetics in the Kentish Plover in East Asia.
In order to infer correct phylogenetic relationships in different taxonomic levels, it is essential to choose unlinked genes with different mutation rates 62 . Our novel set of NPCL markers offer a wide range of mutation rates. The comparisons of mutation rates between the new NPCL markers with commonly used nuclear loci RAG1 7 and some loci at mtDNA 63 provide a reference for marker choice (Figs 2B and 5). In principal, it is advisable to use markers with slow mutation rates to resolve deep nodes and fast mutation rates to population genetic studies. Moreover, coalescent theory is widely used to estimate species tree and population demographic parameters, such as divergence times 64,65 and effective population sizes (N e ) 59 . The associated analyses, such as species-tree estimation, e.g. MP-EST 66 , *BEAST 67 , BP&P 68 , and demographic analysis, such as Isolation with Migration (IM) model 69,70 and Approximate Bayesian Computation (ABC) simulations 59 require multiple independent loci with different demographic histories and mutation rates. In this regard, markers from different genomic segments, such as introns (developed previously [35][36][37] ) and exons we developed are preferred to combine to be used. It is no doubt that the present marker set is a useful resource to generate multilocus datasets for avian evolutionary studies in different taxonomic levels. In fact some studies have used these novel NPCL markers to apply the aforementioned analyses 25,30 .
Compared with traditional Sanger sequencing, the fast development of next generation sequencing (NGS) techniques has enabled researchers to obtain genetic polymorphisms easily 71 . For example, Jarvis et al. 41 performed a highly resolved phylogenetic tree of 48 species using phylogenomic methods, and Prum et al. 44 conducted a comprehensive phylogeny of 198 species within the Neoaves, which diversified very quickly, using genome-scale data by targeted NGS. Multilocus methods do not use as much as genomic data. However, we consider that this set of universal NPCL markers has its niche in avian molecular studies. Sanger sequencing technique of NPCL markers is less sensitive to the quality of template DNA like sequence capture approach than other genomic approaches. Degraded DNA or a small quantity of DNA is also workable, like feather and museum samples. It is always a tradeoff between template DNA quality and PCR product length. With the novel NPCL markers, we aimed to amplify a fragment of 700-1200 bp sequence of each locus. Hence they should be applicable to avian blood, tissue and feathers. Moreover, a thorough analysis pipeline for traditional PCR-based method is available, supported by a series of visualized operating software, e.g. MEGA, DNASTAR, DnaSP, BEAST and etc., which are widely used in molecular phylogenetic analysis. Processing genomic data always places high demands on bioinformatics and computational power 72 . High-quality samples for NGS, project budget, bioinformatic facilities are not available to all laboratory. It is still useful and necessary to align orthologous sequences across multiple hierarchical levels using NPCL markers, especially for a pilot or small scale study.
However, there are some limitations when using this set of universal NPCL markers. Firstly, PCR performances were simultaneously tested under a unified protocol (e.g. Tm = 50 °C), so that the PSR of each NPCL marker might be underestimated. Reducing the annealing temperature by 1~2 °C would improve the success rate in practice. There is also the possibility that PCR produced target sequences but also non-specific amplicons. We could slightly raise the annealing temperature to increase specificity or perform extra steps including gel purification and cloning. Furthermore, the interspecific polymorphic parameters of the 17 NPCL markers are reference values for Kentish plovers. Different evolutionary forces, such as genetic drift or natural selection, can act on different regions of the genome, causing a various evolutionary rates and demographic histories in different species 73 . Thus, different combinations of markers are important for specific questions. For example, NPCL markers on the Z chromosome could be selected to solve questions involving sexual selection and mate choice. There is also a trade-off between the number of markers and time-and cost-efficiency. In avian phylogenetic analysis, random errors can be reduced by employing more markers, whilst, as a consequence of this procedure, systemic errors would increase due to differences in nucleotide composition and various mutation rates 74 . Kimball et al. proposed that adopting various analytical methods might overcome these adverse effects 75 .
In conclusion, we have developed 63 avian universal NPCL markers, evenly distributed across 17 autosome chromosomes and the Z chromosome. This set of universal NPCL markers had high PCR success rates (PSR ≥ 80%) in 23 avian orders. Its wide range of mutation rates are suitable to resolve phylogenetic relationships at both low and high-level. Furthermore, various intraspecific polymorphisms are potentially useful to provide deep-level divergence and demographic information for population genetics. Though high-throughput genetic polymorphism data from next generation sequencing undoubtedly provide a more comprehensive vision for avian evolutionary history and genomic patterns, we believe that this set of exonic markers provides a relatively reliable and repeatable solution and could have widespread application in phylogenetic and population genetics studies.

Development of NPCL markers and primer design.
To screen NPCL marker candidates, we aligned parts of the genome of two species with a distant phylogenetic relationship 76 , the Red Junglefowl (GCA_000002315) and the Zebra Finch (GCA_000151805). Firstly we identified long (>600 bp) single-copy exons within the genome of the Zebra Finch and took these exons as templates. Then we aligned them with the genome of the Red Junglefowl using BLAST (Basic Local Alignment Search Tool). We assumed that query sequences of the Red Junglefowl with identity more than 80% and length more than 50% of templates length were orthologous exons and employed them as NPCL marker candidates.
We used the program Primer3 77 to design the primers for NPCL marker candidates. We selected exon sequences of Zebra Finch as templates, focusing on High-scoring Segment Pairs region (HSP, 700-1200 bp). For each primer pair, the oligomer ranged from 18 bp to 25 bp and GC content ranged from 20% to 80%. Furthermore, we tested a single primer for self-complementarity by setting complementarity score to less than 6.00, so as to predict the tendency of primers to anneal to each other without necessarily causing self-priming in the PCR. Complementarity 3′ score was set as default (<3.00) to test the complementarity between left and right primers.
Tests on the universality of the NPCL markers. To test the amplification performance of these new NPCL markers, we selected 41 species of 23 representative Aves orders (Supplementary Table S2). We used a set of 10000 trees with 9993 operational taxonomic unites (OTUs) downloaded from http://birdtree.org/ to demonstrate the phylogenetic relationships among selected species 18 .
Total genomic DNA was extracted from ethanol-preserved muscle tissue or blood using a TIANamp Genomic DNA kit (TIANGEN, China) and stored at 4 °C. DNA concentration and purity were estimated by NanoDrop 2000 (Thermo Scientific, USA). We used Touchdown PCR (TD-PCR) 78 , an improved standard PCR to test the utility of primers sensitively, by decreasing the annealing temperature 1 °C/cycle from Tm +10 °C to Tm (Melting SciENTific RepoRts | (2018) 8:15723 | DOI:10.1038/s41598-018-33646-x Temperature). The Touchdown PCR was performed in a Veriti96 PCR thermal cycler system (ABI, USA) using a 10 μl reaction containing 2 μl template DNA (10-70 ng totally), with mixed concentrations of 10 × PCR buffer, 20 μM dNTP, 10 mM of each forward and reverse primer, and 5U Taq polymerase (Takara, China). The initial temperature profile was 2 min at 94 °C, 10 cycles at 94 °C for 30 s, 60-50 °C (decreasing the annealing temperature by 1 °C per cycle) for 30 s and 72 °C for 90 s followed by 30 similar cycles but with a constant annealing temperature of 50 °C. This process was concluded with an extra elongation step at 72 °C for 10 min. A successful amplification was recorded if a single clear band (target locus) was observable under ultraviolet light after being isolated on a 1% TAE agarose gel at 120 V for 30 min.
Estimation of inter-species mutation rates and construction of Neoavian phylogeny. We downloaded 48 avian genomes covering all orders of Neoaves 41 . Sixty-five gene sequences, including the 63 universal NPCLs and two frequently-used genes (RAG1, and mitochondrial cytochrome b (cyt b)) as control, were retrieved and aligned in these genomes against genes in Zebra Finch (abbreviation as Tgu1) using BLAST. These sequences were extracted and filtered in batches by own-developed Perl script (shared in DRYAD https://doi. org/10.5061/dryad.ht3823d) in the Tianhe-2 server (School of Advanced Computing, Sun Yat-sen University). Because of different genome sequence format, the script was unable to retrieve sequences in four species, i.e. Anas platyrhynchos, Gallus gallus, Meleagris gallopavo and Melopsittacus undulates 41 . Hence, we manually searched and obtained orthologues of the four species using BLAST tool on the website https://blast.ncbi.nlm.nih.gov/Blast.cgi. We further aligned the obtained NPCL orthologous sequences using MEGA v6.0 79 .
To estimate the mutation rate of each gene, we firstly computed the overall mean genetic distance at each NPCL marker in MEGA v6.0 with 1000 bootstrap replicates. Then we calculated the ratio of genetic distance between each NPCL and cyt b. Finally, we multiplied the ratio by the average mutation rate in the cyt b (0.01035 mutations per site per million years) 80 to get the average mutation rate of each NPCL 81 .
To construct the phylogenetic relationship of Neoavian birds in order level, we concatenated all NPCL sequences and reconstructed the maximum likelihood tree by RAxML v8.2.1 82 , with GTRCAT model and 1,000 bootstrap runs. Maximum-likelihood-bootstrap proportions ≥70% were considered strong support 83 .
Intra-species polymorphism measurements. We amplified 17 random NPCLs in 40 Kentish Plover (Charadrius alexandrinus) blood samples from live-trapped birds in a noninvasive manner. To compare our data with previous genetic analyses on European populations of the Kentish Plover 84 and compare the degree of polymorphism between nuclear and mtDNA, we added two mtDNA loci, ATPase subunit six concatenated with partial ATPase subunit 8 (ATPase6/8) and NADH dehydrogenase subunit 3 (ND3). Blood samples were collected from four breeding populations of plovers, Guangxi (GX), Qinghai (QH), Hebei (HB) and Taiwan (TW) ( Table S2). The same protocol for DNA extraction and PCR amplification was followed as above, and the products were sequenced on ABI3730XL (Applied Biosystems, USA) by Beijing Genomics Institute (BGI, China).
Both strands of the amplicons were assembled, and the heterozygosity of nuclear genes was detected using SeqMan v7.1.0.44 85 . Some parameters of the DNA polymorphism, the number of polymorphic sites (S) and haplotypes (H), haplotype diversity (Hd), and nucleotide diversity (π) were calculated using DnaSP v5.0 86 . The neutrality of each locus was tested using the Hudson-Kreitman-Aguade (HKA) test 87 and Tajima's D 88 implemented in DnaSP v5.0.