Genome-wide association study for performance traits in chickens using genotype by sequencing approach

Performance traits are economically important and are targets for selection in breeding programs, especially in the poultry industry. To identify regions on the chicken genome associated with performance traits, different genomic approaches have been applied in the last years. The aim of this study was the application of CornellGBS approach (134,528 SNPs generated from a PstI restriction enzyme) on Genome-Wide Association Studies (GWAS) in an outbred F2 chicken population. We have validated 91.7% of these 134,528 SNPs after imputation of missed genotypes. Out of those, 20 SNPs were associated with feed conversion, one was associated with body weight at 35 days of age (P < 7.86E-07) and 93 were suggestively associated with a variety of performance traits (P < 1.57E-05). The majority of these SNPs (86.2%) overlapped with previously mapped QTL for the same performance traits and some of the SNPs also showed novel potential QTL regions. The results obtained in this study suggests future searches for candidate genes and QTL refinements as well as potential use of the SNPs described here in breeding programs.

SNPs validation. The dataset of 134,528 SNP chromosomal positions obtained with the CornellGBS before and after the imputation analysis was compared with the 600 K Affymetrix ® HD genotyping array dataset in order to perform a method validation, since both sets were obtained from the same animals (5 individuals from F 2 -7810 family). The genotype concordance of the SNPs with concordant chromosomal positions detected between the two methods is shown in Table 2. On average, 91.80%, and 91.66% of the SNPs had concordant genotypes between the CornellGBS and 600 K datasets before and after imputation, respectively. The accuracy of the heterozygous genotypes was slightly lower after the imputation. Reduced representation methods, like CornellGBS, has limitations calling the heterozygous markers 39 . In our study, we have observed that 82.14 and 82.30% of heterozygous SNPs, while 97.97 and 97.65% of all homozygous SNPs were validated before and after imputation, respectively.  Table 1. Means, standard deviations (SD), maximum (max) and minimum (min) values for performance traits of 444 individuals from the F 2 population. *g is the weight measured in grams.   sites), the average frequency of heterozygous SNPs was 25.32% (± 5.6%) before the imputation and after the imputation, it increased to 27.70% (± 5.2%). The average heterozygosity observed per chickens before imputation ranged from 8.30-44.69% and after the imputation between 11.38-44.67%. The proportion of heterozygous SNPs remained virtually unchanged before and after imputation among the lines/generations (Table 3).

Principal component analyses.
From the list of imputed genotypes we have conducted a principal component analysis (PCA), based on covariance, using Tassel v.5.2.26 39 to check the F 2 population structure. This plot was useful for visualizing internal structure explained by the variance from PstI-derived SNPs dataset of 134,528 SNPs using eigenvector-based multivariate analyses. Each individual lies in its proper group consistent with our F 2 population structure composed by five F 2 dame-based families (Fig. 1).
Descriptive Statistics of Heritability. The genetic and residual variance for each trait and their genomic heritability are shown in Table 4. Heritabilities ranging from moderate to high, as is expected 40,41 , were observed for feed intake and body weights traits, respectively. Low heritabilities were observed for the traits evaluated in short period (between 35 and 41 days) as feed conversion, and feed efficiency, because they are complex traits influenced by several environmental factors 42 .
Genome-wide association study. Twenty significant SNPs (P < 7.86E-07) were associated with feed conversion adjusted to body weight at 35 days (adj35) and one significant SNP associated with body weight at 35 days of age (Fig. 2). In addition to that, 92 suggestive (P < 1.57E-05) SNPs were associated with feed conversion adj35, feed intake adj35, feed efficiency adj35, birth weight, and body weight at 35 and 41 days of age (see Supplementary Spreadsheet S1 for the effects of associated SNPs; Manhattan and QQ plots are available on Supplementary Fig. S1).
Linkage disequilibrium analysis. Seventeen haplotype blocks were generated from the associated and suggestively associated SNPs from the F 2 population (see Fig. 3, Supplementary Fig. S2 for haplotypes details  Table 3. SNP heterozygosity of genotyped populations (parental, F 1 and F 2 generations) after and before imputation. and Supplementary Table S1 to SNPs' Mendelian descriptions). We noticed a standard block pattern between the SNPs that matched with the F 2 population structure (Fig. 3). Interestingly, we have checked the genotype frequency of blocks formed by LD analysis to determine if the blocks were fixed or not in the parental lines. From the  Table 4. Genetic and residual variances, and genomic heritability for each trait analyzed in this study. 35adj is adjusted to body weight at 35 days, SE is the standard error, and for Heritability, the SE is the standard deviations obtained from a repeated sampling approach (see M&M section for detailed information). haplotype blocks, we checked the origin of the variation (fixed or variable) and frequency from F 2 generation in the parental lines (Supplementary Table S2 and Supplementary Fig. S2 for a more detailed description of frequencies). We also determined the advantageous haplotype for each trait in the F 2 generation (Table 5). This information enabled us to identify from which parental line (TT or CC) comes the genotypic variation observed in F 2 for each block. All blocks with r 2 > 0. 56

Discussion
For better understanding of complex traits control in a segregating F 2 population, our research group have focused the attention on genetic association and linkage analyses using different approaches, as: candidate genes 14-17 and QTL mapping 3-7 , respectively, and more recently, NGS approaches 27,28 . We have presented here the first study using a higher density of SNPs in this F 2 population with GWAS purpose. Therefore, we have optimized a method called CornellGBS in chickens 35 to overcome the concept of pre-designed panels, since we planned a method for genotyping efficiently a specific dataset of SNPs in our specific population. CornellGBS is a widely employed method to genotype large genomes of model and non-model species exploring important regions in the genome 34 as microchromosomes 36,37 , as previously mentioned. This is due to the high coverage of tags (contigs) depending on the number of sequenced individuals of the reduced genome by restriction enzyme cleavage providing a specific SNP profile 35 . The CornellGBS technique was previously developed for inbreeding population and it is known by its general low sequencing coverage, which can cause significant loss of SNPs, mainly heterozygous 39 . For our outbreed population, we used a reasonable multiplex of individuals (~48 animals per lane of Illumina flowcell) to maintain a reasonable sequencing coverage per individual (~5X). We also reduced the loci call rate and use imputation to increase the number of SNPs genotyped. The reduction in the loci call rate was also applied in a recent study that used the same PstI restriction enzyme to cleave the cattle genome 44 . Furthermore, it was already mentioned that the combination of GBS and imputation of missing internal SNPs in haplotype blocks procedures can promote a cost reduction by allowing further reduction of the filtering criteria or sequencing coverage without causing losses in SNP calls 34 . Using this strategy, we doubled the number of SNPs, successfully imputing all lost genotypes (increasing the individual call rate to 100%), the validation ratio remained > 90%, and the percentage of heterozygous genotypes in our population had an increase of approximately 2% after the imputation.
The use of the GBS SNP panel for GWAS in our outbred F 2 crosses resulted in 20 SNPs associated (P < 7.86E-07) with feed conversion adj.35, one SNP associated with body weight at 35 days (BW35) and other 93 SNPs suggestively associated (P < 1.57E-05) with different performance traits (Table 1). Additionally, we noticed that all the evaluated traits, presented an up deviation of the theoretical quantiles ( Fig. 2 and Supplementary Fig. S1) of the probability distributions between expected and observed p-values, indicating the existence of QTLs. These results corroborated the Manhattan plot peaks of associated SNPs, indicating that these traits had part of the phenotypic variation significantly explained by the genetic component 45 . Interestingly, we detected association for several new QTLs located in microchromosomes (GGA11-28). This was only possible because of the distribution of the SNPs used. From our set of SNPs, 38.93% are located in large chromosomes (GGA1-5), 14.15% in intermediate size (GGA6-10), and most, 46.90% are located in microchromosomes (GGA11-28), confirming the microchromossome enrichment mentioned before 35 . Feed conversion, for exemple, had a high number of significant SNPs (P < 7.86E-07), mainly located in microchromosomes (GGA8, 10, 14, 18, 23, 26, and 27). However, for this trait, the SNP peaks observed by Manhatan plot, in large and intemediate size chromossomes (Fig. 2a), were not well defined, as is usually observed for QTLs peaks 46 . We believe that this is explained by the SNP profile used in this study, which has a lower density of SNPs on large chromosomes compared to microchromosomes 35 . Moreover, feed intake is a complex trait subject to a high residual effect and controlled by several genes with a small effect, which require a large sample size to detect associations 47,48 . This small effect also was previously attributed to the short period used to measure this trait (between 35 and 41 days of age) impairing the animal adaptation to the new enviromental condiction 4 . On this account, reliable QTLs for this trait were detected in studies that used a larger sampling size (1,534 individuals) for a longer feed intake evaluation period (~4 weeks) 49,50 than used in this study.
The GBS strategy can result in clusters of SNPs next to each other 35 . In order to better define the QTL regions we performed LD analysis. The block pattern between the SNPs matched with our F 2 population structure 51 (Fig. 3). This allowed us to define possibilities for genetic selection of the lines that did not present the genotype fixed giving attention to the different phenotypic abilities between the CC layer line and the TT broiler line 5,16,51 . As for example the blocks 2 and 13 (consider Fig. 3(d) for block numbers), which had variable genotypes for both the CC and the TT lines, and the paternal line presented the favorable genotype most frequent in both cases (see Supplementary Fig. S2 and Fig. 3).
Also is important to check if these SNPs are within QTL regions previously published. In the past years, many studies identified QTLs associated with performance traits in different chickens populations [3][4][5][6][8][9][10][11][12][13]17,52,53 (1,458 QTLs described in the Animal QTLdb) aiming to map loci that control these traits. Recently, to better understand these loci, studies have also applied GWAS with performance traits in chickens 13,26,54 . The validation of single SNP position obtained by GWAS overlapping with QTL regions can confirm interesting genomic regions to explore. From the 94 genome-wise associated and suggestively associated SNPs with the performance traits analyzed in this study, most of them were fairly distributed in mapped QTL regions in the chicken genome (Fig. 4)  13 SNPs did not overlap with QTL regions previously mapped. From these 13 SNPs, one was located on chromosome 1, one in chromosome 8 (GGA1 and 8), one in the Z sex chromosome (GGAZ) and 10 were located in microchromosomes (GGA17, 18, 20, 25, 27 and 28), which confirms the microchromosome enrichment profile obtained by this approach 35 and suggests novel QTLs to be explored in these regions. It is also important to mention that most of these 13 SNPs (those located on GGA1, 8, 18, 20, 25 and 27) were associated (P < 7.86E-07) or suggestively associated (P < 1.57E-05) with feed conversion adj35, one with feed efficiency adj35 (GGA17), and two with body weight at 41 days (GGA27) (see Supplementary Spreadsheet S1 for details). The genes where these SNPs are located are mainly related with cell cycle and metabolic pathways (according to the Reactome pathwayshttp://www.reactome.org/PathwayBrowser) and were within introns, upstream and downstream of these genes (see Supplementary Table S4 for functional annotation). Despite the importance of the overlap test performed here, previous studies in QTL mapping usually had large confidence intervals (> 1 Mbps) and often encompassing several genes, making difficult the selection of candidate genes 55 . Therefore, we also checked the overlap of these SNPs only with QTLs mapped using specifically the same traits and the same F 2 population 4,5,14,17 used in this study. From 23 different QTL intervals, we identified 12 SNPs overlapping with seven of them (see the QTLs bolded in Supplementary Tables S3 and S5 to check the QTL list). It is worth mentioning the SNPs located near the QTL regions, or flanking regions (see Supplementary Fig. S3). On GGA1, for exemple one SNP (marker 6, see Supplementary Spreadsheet S1) associated with feed intake (P = 3,83E-07) overlaped with one QTL mapped for the same trait 50 in another population, and also with 6 QTLs mapped for body weight at different ages 56 and one with feed efficiency 57 . On the other hand, on GGA4, a well studied chromosome in chickens 4,9,11,12,14,16,20,[23][24][25]54,[58][59][60][61][62][63] , we identified three SNPs composing the haplotype 3 (markers 21-23), in which one was associated with BW35 (P < 7.86E-07) and suggestively associated with BW41 (P < 1.57E-05), and two SNPs suggestively associated with BW41 (P < 1.57E-05). These three SNPs overlapped with one QTL region previously mapped in this same population for these same traits 4 (QTL_IDs from ChickenQTLdb = 7157; 7162 and 7185). The boundary SNPs from haplotypes 2 and 3 ( Fig. 3(d)) are separated by a short distance (less than 4 Mbps), but these QTLs are not linked, beside they have effect on the same traits (BW35 and BW41) (see Supplementary Spreadsheet S1) in our F 2 population. It is worth to mention, the haplotype 2 that overllaped with QTLs mapped for different BW in different ages 52,53,64-67 and growth 13,52,67 traits in different populations.
To the best of our knowledge, we showed the application of the CornellGBS PstI-derived SNPs to a GWAS for the first time in chickens. We showed a strategy, changing filtering criteria and subsequent genotype imputation, to increase the number of reliable SNPs to be analyzed. We found 13 SNPs indicating new regions associated with performance traits, mainly in microchromosomes, that have not been previously reported. We improved the available information about loci controlling performance traits and we refined these regions to discover novel candidate regions to be explored. Finally, by demonstrating that GBS is a valid strategy for QTL mapping in a species that has genome sequence and SNP panel available, we can argue the validity of GBS in species without genome resources. Methods All experimental protocols employed in the present study that relate to animal experimentation were performed in accordance with the resolution number 010/2012 approved by the Embrapa Swine and Poultry Ethics Committee on Animal Utilization to ensure compliance with international guidelines for animal welfare.
Animals and Phenotypes. This study was conducted using 464 chickens from a F 2 populations originated from a reciprocal cross-experimental population from Embrapa Swine and Poultry National Research Center, Concórdia, SC, Brazil. We also included in the analysis, 10 chickens from their parental lines TT and CC (5 from each one), and eigth from the F 1 generation. This F 2 population was developed for QTL mapping studies, and was originated from the crossing of seven males from a broiler line (TT) and seven females from a layer line (CC), resulting in seven full-sib families (F 1 generation). Then, twenty-one F 1 females were artificially inseminated with seven F 1 males (3:1 ratio) to generate the F 2 . The F 2 population comprised seven paternal half-sib families composed of three full-sib families of approximately 100 individuals each, produced in 17 hatches, totaling 2,063 F 2 chickens 51 . For this study, we selected the five most informative families based on the previously QTL studies 4,16,63 .
The TT broiler line was selected over six generation to improve body weight, feed conversion, carcass and breast yields, viability, fertility, eclodibility, reduction of abdominal fat and metabolic syndromes 51 . The CC layer line was selected over eight generation to improve egg production, egg weight, feed conversion, viability, sexual maturity, fertility, eclodibility, egg quality and reduction of body weight 51 .
The F 2 chickens were reared as broilers with free access to corn and soybean meal-based diet and water up to 42 days of age. From 35 to 41 days, they were transferred to cages to collect feed intake, and to compute the conversion and efficiency. Body weight was recorded at 1 (birth weight), 35, 41 and 42 (after fasting) days of age. The body weight (BW) at 41 days of age was collected at the end of the conversion measurement and, BW42 days was collected after 6-h fasting period and transportation to the slaughterhouse. More details were previously provided 5,16,51 . We analysed these six performance traits in this study (feed conversion, feed intake and feed efficiency between 35 to 41 days, birth weight, and body weight at 35 and 41 days of age). A total of 23 missed values from the selected traits were imputed by mean using Tassel v.5.2.26 tool 39 among the 446 F 2 animals selectect to be evaluated in this study.
DNA, genotypic data and imputation. Genomic DNA was cleaved with the PstI enzyme, ligated to adapters with barcodes identifying individual animals, and then sequenced on Illumina platform. After filtering parameters were applied, 134,528 SNPs were identified from 462 individuals in our experimental population of chickens using minimum minor allele frequency (mnMAF) of 1%; minimum taxon coverage (mntCov) of 20% and minimum site coverage (mnScov) of 70% filter parameters. All procedures to obtain the data were previously described 35 . After filtering parameters, the number of missing genotypes increased from 0.9 to 5.8 million. This number represents SNPs identified multiplied for the number of individuals genotyped (67,096 derived PstI-SNPs × 462 individuals using 90% of loci call rate and 134,528 derived PstI-SNPs x 462 individuals using 70% of loci call rate). The imputation of missing genotypes was performed using Beagle 4.1 software 38 using default parameters, which uses empirical LD model. This model adapts to the local structure in the data using iterative approach to haplotype phasing in which an initial prediction of haplotype phase is made, then the model is fit, and improved estimates of haplotype phase are obtained and the model is refit 38 . SNPs validation. The validation was performed by the comparison of the filtered 134,528 PstI-derived SNPs dataset with the 600 K Affymetrix ® HD genotyping array SNP dataset using five individuals from F 2 -7810 family.
The SNPs in both datasets were located on GGA1-28, 32, W and Z chromosomes on the Gallus-gallus-4.0 reference genome. The validation standards used in this study were based on a methodology previously proposed 68 considering chromosomal positions and genotype concordances. With these concordances, we estimate the accuracy for homozygous and heterozygous SNPs.
Principal component analysis. The principal component analyses (PCA) of imputed genotype data were performed using Tassel v.5.2.26 39 considering five principal components. This tool transforms a set of correlated variables into successive orthogonal PCs accounted for the maximum variance providing a way to highlight groups of individuals differing at the level of minor allele frequency 39 . The PCA graph was produced using the plot function in R 2.13.2 software (http://www.r-project.org/).

Genome-Wide Association Study.
The compressed mixed linear model (MLM) implemented in TASSEL v.5.2.26 software 45 was used for GWAS. The following form represents the statistical model: where y is the vector of the dependent variables, β is the vector containing fixed effects, including the sex (male/ female), hatch (1-17) and SNP (− 1, 0, 1) effects, and the covariate body weight at 35 days for traits measured from 35 to 41 days of age (feed intake, feed efficiency and feed conversion); u is the vector of random additive genetic effects from background QTL for individuals, X and Z are design matrices, and e is the vector of random residuals. We assumed that u and e vectors are normally distributed with null mean and variance of ; σ a 2 is an unknown additive genetic variance and K is the kinship (co-ancestry) matrix calculated from SNPs and provided by the same software using centered identity by state (IBS) method. For the residual Scientific RepoRts | 7:41748 | DOI: 10.1038/srep41748 effect, homogeneous variance was assumed, with σ = R I e 2 , where I is an identity matrix and σ e 2 is the unknown residual variance. The Restricted Maximum Likelihood (REML) estimates of σ a 2 and σ e 2 were obtained by the Efficient Mixed-Model Association (EMMA) algorithm 69 . Heritability (h 2 ) was calculated as the ratio of the additive genetic variance (σ a 2 ) to the phenotypic variance (σ a 2 + σ e 2 ). Tassel program does not provide the standard errors of the estimates. Thus, standard errors were estimated using the REML method with an average information (AI) algorithm by AIREMLF90 software 70 . Standard errors for additive genetic and residual variance were computed as square roots of diagonal elements of the inverse of the average information matrix. For Heritability, standard deviations obtained from the repeated sampling approach were considered as their standard error 71 . Each SNP allele was fit as a separate class with heterozygotes fit as additional SNP classes. The total SNP effect was not decomposed in additive and dominance effects but tested for overall significance 45 Quantile-quantile (Q-Q) plots for each trait and Manhattan plots of genome-wide association analyses were performed in R using ggd.qqplot and Manhattan functions. The thresholds were corrected for multiple testing (Bonferroni) established by the estimated number of independent SNPs and LD blocks (pairwise r 2 values > 0.40) 72 that was 63,640 SNPs. We set two thresholds from our data: P < 1.57E-05 (1/63,640) for suggestive significance and P < 7.86E-07 (0.05/63,640) for 5% genome-wise significance level 54,60 .
Linkage disequilibrium analysis. The linkage disequilibrium (LD) analysis was performed with Haploview 4.2 73 as well as the LD graphs. To perform the pairwise comparison of ours SNPs considering 1 Mb apart, we selected 94 genome-wise associated and suggestively associated SNPs with the traits analyzed in this study (feed conversion, feed intake, feed efficiency, and weight gain between 35-41 days of age, birth weight and body weight at 35 and 41 days). We have defined the haplotype blocks by the solid spine of LD and the family structure of parental lines (CC and TT pure lines), F 1 and F 2 generations (Fig. 3)  QTL overlapping with SNPs. The QTL data was obtained from QTLdb 43 and the overlapping test was performed using the GenomicRanges package from R 3.3.1 software (http://www.r-project.org/). For the data presentation, we designed a figure using ggplot2 package karyogram layout from R 3.3.1 software.