Imputation-Based Whole-Genome Sequence Association Study Rediscovered the Missing QTL for Lumbar Number in Sutai Pigs

Resequencing a number of individuals of various breeds as reference population and imputing the whole-genome sequences of individuals that were genotyped with medium-density chips to perform an association study is a very efficient strategy. Previously, we performed a genome-wide association study (GWAS) of lumbar number using 60K SNPs from the porcine Illumina chips in 418 Sutai pigs and did not detect any significant signals. Therefore, we imputed the whole-genome sequences of 418 Sutai individuals from 403 deeply resequenced reference individuals and performed association tests. We identified a quantitative trait locus (QTL) for lumbar number in SSC1 with a P value of 9.01E-18 that was close to the potential causative gene of NR6A1. The result of conditioning on the top SNP association test indicated that only one QTL was responsible for this trait in SSC1. The linkage disequilibrium (LD) drop test result for the condition of the reported potential causative mutation (c.575T > C missense mutation of NR6A1) indicated that this mutation was probably not the underlying mutation that affected lumbar number in our study. As the first trial of imputed whole-genome sequence GWAS in swine, this approach can be also powerful to investigate complex traits in pig like in human and cattle.

SNPs is needed to identify missing QTLs. With the rapidly decreasing costs of next-generation sequence technology and the increasing accuracy of sequencing, numerous researchers have employed sequencing or resequencing to understand the demography, diversity and selection sweep of the investigated animals [13][14][15] . However, resequencing thousands of individuals and then determining associations for economically important traits is still an inefficient strategy. A more efficient approach is to impute the whole-genome sequence genotypes of individuals genotyped with medium-density chips using a previously sequenced reference population, and then determine associations between imputed genotypes and traits of interest using well-developed GWAS software. This approach is very popular for human disease studies, such as HapMap 16 and the 1000 Genomes Project 17 , which provided standard reference panels. This approach has also worked very well in cattle, such as the 1000 bull genomes project (Run 2.0) 18,19 . To the best of our knowledge, there are still no GWASs using whole-genome resequenced data in pigs.
Previously, we performed a GWAS using 60K porcine Illumina chips in Sutai pigs to detect the association loci for lumbar number. We expected to identify significant loci for this trait in Sutai pigs because this breed originated from Duroc and Erhualian pigs, which have similar paternal and maternal structures of an advanced intercross resource family 20 . Unexpectedly, no association signals were identified in Sutai pigs for lumbar number, which was different from the results of most published QTL mapping studies. Therefore, we hypothesized that the non-significant result may have arisen because of the low LD between causal mutation and nearby SNPs. To increase the detection power and decrease the cost of the GWAS, we first imputed the genotypes of 60K chips to the genotypes of whole-genome sequence variants in Sutai pigs using a reference panel containing 403 deep-sequenced individuals. Then, we used the imputed genotypes to reperform GWAS for the same phenotypes with the objective of determining whether there was a genetic variation in NR6A1 associated with lumbar number in this breed. As noted above, the c.575T > C missense mutation of NR6A1 was the strongest potential candidate for lumbar number. However, the causality of this SNP in Sutai pigs was unknown. In this study, we genotyped this mutation to estimate the imputation accuracy and its causality in Sutai pigs.

Methods
Ethics statement. All the experiments that involved animals were performed in accordance with the guidelines approved by the Ministry of Agriculture of China. Approval was obtained from the ethics committee of Jiangxi Agricultural University before this study.
Animals of the target population. The target population of Sutai pig is a synthesized swine breed produced by crossing the Western Duroc and Chinese Erhualian breeds with continued selection for 19 generations. For the present study, we genotyped and phenotyped 526 individuals. The pigs were raised with the same fodder under uniform circumstances and slaughtered at 240 days of age in a commercial slaughterhouse. After the harvest, the carcasses were cut into halves and the numbers of lumbar vertebrae were counted and recorded. The lumbar number was either 5 or 6 in 436 pigs, including 206 gilts and 230 barrows, and the lumbar number was not available for 90 animals. More detailed information on the pigs' environment and other phenotype data for these experimental animals were provided in our previous study 21 .
Genomic DNA samples were extracted from ear tissue using the standard phenol/chloroform method 22 , and the samples were diluted to a standardized concentration of 50 ng/µl after the quality was checked. A total of 526 samples were genotyped using Illumina PorcineSNP60 Beadchips, including 62,163 SNPs, on an iScan System (Illumina, San Diego, CA, USA) 23 . Quality control (QC) was conducted using PLINK (v1.90 beta) to detect and exclude unreliable genotypes 24 . SNPs with a missing rate of each marker (geno) >0.1 or with minor allele frequency (MAF) <0.05 were excluded. Individuals with a call rate <0.9 were also removed. To maintain consistency with the sequencing data, the primer sequences of each SNP were aligned to the reference porcine genome assembly Sus-scrofa 10.2 using BLAST to detect their positions and forward (reverse) strand information. SNPs without positions were excluded, and the genotypes of reversed SNP strands were flipped using PLINK software.
Haplotype construction of the reference panel. In this study, a wide collection of 403 whole-genome sequence data from 10 different pig populations 15,[25][26][27] was used as a reference and each breed contained 9 to 86 pigs. More details on the breeds, origins and sample size are listed in Table 1. The sequencing coverage of these individuals ranged from 5 to 25. The raw reads were cleaned based on a quality score threshold >15, which passed chastity filtering and would be then aligned to the reference porcine genome assembly Sus-scrofa 10.2 using BWA (Burrows-Wheeler Aligner) 28 . Variants were identified following the GATK (Genome Analysis Toolkit) 29 best practice protocol. PCR duplications were first marked by Picard MarkDuplicates (http://broadinstitute. github.io/picard/), and local realignments were performed with GATK IndelRealigner. Individual GVCF files were produced using GATK Haplotypecaller. Variants were called and filtered with GATK Genotype GVCFs and VariantFiltration options. Structural variants were removed with VCFTOOLS 30 . With cleaned SNP data, the haplotypes of 403 individuals were constructed using Beagle (v4.1) 31 . Imputation. Imputation from 60K SNPs to whole-genome sequences for Sutai pigs was conducted with Beagle (v4.1) 32 using the default parameter settings, and the size of each sliding window was set to 7,000,000 bp. This software is based on a hidden Markov Chain Monte Carlo algorithm for imputation that first constructed local haplotypes using the MCMC algorithm and then resampled new estimated haplotypes for each individual using the HMM model.
Because of the very low density and common variants (MAF > 0.05) in 60K (Illumina, San Diego, CA, USA), imputation accuracy should be investigated in whole-genome sequence data. We used a 15-fold cross-validation strategy described in several previous studies [33][34][35] . Ninety individuals were selected randomly from the sequenced reference population as a target population for each fold (i.e. there would be some same individuals sampled in different target populations), and the genotypes in this target population were reduced to the variants that were included in the 60K genotyping array. The remaining individuals (313) were included in the reference panel. Two validation actions were taken to calculate the accuracy of imputation. One action was allelic correct rate (CR), which calculated as the number of alleles imputed correctly divided by total alleles at each locus, and the more detailed formula (see equation (1)) was as follows: where m and N are the number of individuals and SNPs, respectively, and Obs (n ij ) and Imp (n ij ) are the observed and imputed numbers of allele "1" for individuals i at marker j, respectively. The other action was the correlation coefficient between true and imputed SNPs. To investigate the imputation accuracy impacted by MAF, we classified CR and correlation into 10 classes with regard to the MAF of imputed SNPs. The accuracy of imputation was the mean CR or correlation across 15 folds for each class.

GWAS analysis.
The associations between lumbar number and imputed genotypes were tested using GEMMA (v.0.93) 36 . This method implements a mixed model 37 (see equation (2)) including covariates when we carried out conditional association test and LD drop association test, SNP effects, individual effects and residual error, which were calculated with the following formula: where y is the vector of phenotypes; W is a matrix of covariates, including a column of 1s; α is a vector of the corresponding coefficients, including the intercept; x is a vector of genotypes; β is the effect of markers; u is a vector random effect following the multivariate normal distribution (see equation (2)), in which τ −1 is the variance of the residual errors, λ is the ratio between τ −1 and ε, and K is a kinship matrix that is estimated from whole-genome sequence variants; ε is a vector of errors following the multivariate normal distribution (see equation (2)) and I n is an identity matrix. Using naïve Bonferroni corrections of 0.05 divided by the number of examined SNPs would lead to an overly conservative threshold because these SNPs were highly correlated with each other. Pe' er et al. and Johnson et al. suggested that 5E-08 could serve as a genome-wide significant threshold in human GWASs based on haplotype blocks of an African population structure 38,39 . Based on the assumption that an equal number of independent haplotype segments between pigs and humans are held, we used the same genome-wide threshold in our study. The model for the GWAS of Sutai pigs with 60K genotypes was the same as that used for whole-sequence association tests and the kinship matrix was estimated either from 60K SNPs (original SNP-data) or whole-genome sequence variants. To make the results comparable, the values of the 60K marker from the results of the whole-sequence association study were extracted for comparison.

LD analysis.
To detect the linkage disequilibrium (LD) of SNPs near the most significant SNPs in the GWAS results, the 3 Mb region near the top SNPs in the whole-sequence association results was used to conduct LD analysis by extracting genotypes from the 60K data set using Haploview (v.4.2) software 40 . Haplotype blocks were then estimated with a confidence intervals algorithm in Haploview.
Genotyping of c.575T > C locus. Variation of the c.575T > C (rs326780270) of NR6A1 in Sutai pigs was detected following the methods of Yang et al. 41 . Briefly, a 360 bp segment was amplified and cut into two pieces of

LD drop association test.
To determine whether NR6A1 c.575T > C was the mutation that determined lumbar number in Sutai pigs, we performed an LD drop test by including the genotypes of NR6A1 c.575T > C in the mixed model framework to determine how rapidly the association with the signal decreased.

SNP characteristics after QC in the target panel.
After QC, 11,338 variants were excluded for the lack of chromosome position information, 42 pigs were removed due to a low genotype call rate, 3,229 variants were removed due to a low call rate and 9,804 variants were excluded for low minor allele threshold(s). Finally, a total of 37,792 SNPs and 484 pigs were introduced to perform further analyses.
Summary of imputation. Imputation was produced using Beagle software. The summarization of imputation results is presented in Table 2. After imputation, we obtained 87,552,595 SNPs for 484 individuals, and 20,985,704 SNPs were kept after filtering with MAF > 0.01. SSC1 was selected for 15-fold cross-validation to calculate the imputation accuracy tested by CR and correlation related to MAF. The correct rate decreased when MAF increased. In contrast, the correlation increased along with the increase of MAF (Fig. 1)

Summary of GWAS.
We conducted a GWAS on the Sutai population in two scenarios, i.e., the target panel data before and after imputation. In the scenario for before imputation, as noted above, no significant loci were detected in Sutai pigs using 60K chips (Fig. 2a,     LD results. By carrying out GWAS with imputation data, we identified the most significant SNP at a position of 299,627,873 bp as well as a total of 31 markers that were extracted from the significant region (3 Mb) in the 60K data that were used to conduct LD analysis. The LD block was shown as follows (Fig. 3). Three blocks were detected on this region using a confidence interval algorithm. The most significant was the smallest block of approximately 212 kb, and the r 2 among each SNP in this region was very low. The NR6A1 gene was not present in any block in this region.
Results of genotyping c.575T > C. Among the 526 samples, a total of 382 pigs were genotyped on the c.575T > C locus. Subsequently, we obtained 187 CC genotypes, 166 CT genotypes and 29 TT genotypes (see Supplementary Table S1). To further confirm imputation accuracy, we compared imputed genotypes to real genotyped genotypes on this locus and found that only 12 of 382 individuals had different genotypes. In other words, a very high allelic imputation accuracy (98.43%) was obtained at this locus.

Results of the conditional association test and LD drop association test. After GWAS was per-
formed by including the most significant SNP from imputed GWAS results in a mixed model as a covariate, no additional genome-wide significant loci were detected on this chromosome, which indicated that only one major QTL affected lumbar number (Fig. 4a).
After fitting genotypes of NR6A1 c.575T > C into the mixed model for the LD dropping test, we still identified a genome-wide significant locus near the top SNP at a position of 299,432,549 bp with a P value of 1.93E-08. This result probably indicated that locus NR6A1 c.575T > C was not the causative mutation in Sutai pigs for lumbar number (Fig. 4b).

Discussion
Imputation-based association studies have achieved great success in humans [42][43][44][45] and some livestock, such as cattle 46 . Both have resequenced more than 1000 individuals of multiple populations as reference panels, and the unrelated targets were genotyped using middle-(high-) density SNP chips. Whole-genome sequences of the target panel were imputed based on shared haplotype blocks between reference and target individuals and then were used to test associations of complex disease (traits) or to predict the genetic potential of economically important traits. Imputation accuracy ranged from 0.90 to 0.95 in cattle from the genotypes of an Illumina BovineHD genotyping array to whole-genome sequence data 35 . High correlations (0.64) were observed in humans with MAF = 0.1% when imputing an Illumina 1M SNP array to whole-genome sequences using a reference panel of 64,976 haplotypes 45 . In our study, a total of 403 individuals were included on the reference panel, including 32 Duroc and 29 Erhualian pigs, which are ancestors of the Sutai population. CR decreased and correlation increased  Table 3. Description of the most significant 20 SNPs associated with lumbar number by GWAS. Chr: chromosome number; rs: SNP IDs and two SNPs that do not possess rs ID were named after rsxxxxxxxx1 and rsxxxxxxxx2, respectively, by the author; ps: base pair positions on the chromosome; n_miss: number of missing values of the SNP; beta: beta estimates; se: standard errors for beta; l_remle: remle estimates for lambda; l_mle: mle estimates for lambda; p_wald: P value from the Wald test.
along with an increase in MAF. CRs are highly sensitive to allelic frequencies and are not appropriate for comparing SNPs with different values of MAF 47 . Correlation is a more popular approach that is used to evaluate imputation accuracy. The correlation values ranged from 0.74 to 0.86 with an average of 0.80, which was lower than the results for cattle and human studies. Both were imputed from high-density chips (600K in cattle and 1M in human) to sequenced data, and the reference panels were very large. In pigs, the vast majority of studies were based on genotypes from the 60K porcine Illumina BeadChip because a high-density panel (600,000 SNPs) that provides high-quality imputed genotypes in pig populations is currently impractical. Therefore, increasing the number of sequenced populations and individuals in the reference panel to improve imputation accuracy is necessary. Our GWAS results demonstrated that this was a powerful method to identify QTLs in agricultural animals, and this method will help researchers find new loci or rediscover QTLs associated with complex traits.
Since the first application of GWAS research on age-related macular degeneration was performed successfully in 2005 by Klein et al. 48 , GWAS has become an effective method for identifying genetic variations associated with economically important traits in agricultural animals. A recent GWAS study showed that the QTL for the number of vertebrae on chromosomes 1 and 7 independently influenced the numbers of thoracic and lumbar vertebrae 49 . Potentially significant signals could be missed in a GWAS analysis if low-density SNPs were applied to a population that held a low LD characteristic, such as the results of our GWAS when only 60K SNPs were used before imputation and a highly significant QTL was uncovered for lumbar number after imputation. The LD between top SNPs in the 60K original association results (rs81352477) and top SNPs (rs334252332) in the sequence association result was 0.75, indicating a medium correlation. The increased detection power was probably due to causal mutations being in the data by imputation. This result was further confirmed by displaying the LD profiles of markers near the top loci. The top SNP was located in the smallest haplotype block, and the r 2 values among these SNPs in this region were very low, which hampered the discovery of association signals. Furthermore, no haplotype block was found near the NR6A1 gene, which implicated the low LD station in that region in the Sutai population. The Sutai breed was intercrossed from Erhualian female and Duroc male for approximately 19 generations. Thus, the Sutai genome is a mosaic mixture of these two breeds. As a result, the LD block is smaller than the LD block in either of the two founder breeds. In this study, we identified 105 significant SNPs located on chromosome 1 across a region of 4.6 Mb (298,912,325 bp-303,530,285 bp) associated with lumbar number, and the highest signal was located on 299,627,873 bp of chromosome 1. This region contains the NR6A1 gene, which was reported to be associated with lumbar number 6 . The results also showed that using whole-genome resequencing data to perform genotype imputation can be an effective method to identify the QTLs that were missed in low-density SNP GWAS analysis. The imputation method can also narrow the QTL region or improve the power when GWAS analysis is performed. To determine whether population stratification was corrected in this study, we exploited quantile-quantile plots (Fig. 2c and d) from the GWAS with 60K SNP data and imputed the sequenced data. The two quantile-quantile plots with lambda values of 1.08 and 1.06 showed that the population stratification effect was adjusted very well, and the detected signal was most likely reliable.
Although we identified the same QTL as that identified in a previous study 4 , the reported potential causative mutation at position of 299,084,752 bp (c.575T > C) 6 showed only a weak association with lumbar number in our study (P value = 2.26E-06). The possible reason for this result is that the QTN in the position of 299,084,752 bp may not be the causative mutation in the Sutai population. To confirm that NR6A1 c.575T > C was the causative mutation in our population, we performed an LD drop test by fitting genotypes of this locus into a mixed model. Normally, all significant signals nearby would disappear after correcting for causative mutation. The minimum P value increased from 9.01E-18 to 1.93E-08, which still indicated genome-wide significance. This result indicated that NR6A1 c.575T > C was not the causative mutation in our study. The results also indicated that we should recognize that the accuracy of imputation also affects the GWAS result. Several imputation studies in different species have shown that as the minor allele frequency in the target panel decreased, the imputation error rate increased 34,50 . As shown in previous studies, the fundamental aspect of imputation is the identical DNA segments in the target and reference panels, and increasing the number of parents or male parents in a reference panel can increase the imputation accuracy [51][52][53] . In other words, if we can increase the number of individuals in the target panel and reference panel, the imputation accuracy will be increased. Mixing different breeds in a reference panel would thus improve imputation accuracy 54 . In this study, we mixed different pig breeds in the reference panel and executed strict quality control, such as MAF and call rate, in the target and reference panels. We achieved a high CR with an average of 90% and real genotypes of c.575T > C, which confirmed the high imputation accuracy (98.43%). Therefore, this factor may not be very critical in this study, but we also should pay more attention to exploring the factors that affect the imputation results in the future. In addition, a reassociation study using real genotypes at c.575T > C achieved a P value of only 3.89E-07, which further indicates that it was not associated with lumbar number in the Sutai population. To determine whether there are several causative mutations responsible for lumbar number, we performed a conditional test by adjusting the top SNP on SSC1 and conducted GWAS again. Additional significant signals would stand out if the multiple causative mutation hypothesis was true. In our analysis, there were no other QTLs associated with lumbar number, which means there is only one QTL that controls lumbar number on SSC1, whereas the causative mutation is not the same as that previously reported. Further functional studies, such as gene expression and site-specific editing technology, are necessary to confirm the possibility of causality for the top SNP in the Sutai population. Manhattan plots for lumbar number in the conditional association test and LD drop association test, respectively. In the Manhattan plots, the y-axis and x-axis represent the negative log 10 P values of the SNPs and the genomic positions separated by chromosomes, respectively. In Manhattan plots a and b, the black solid lines indicate the chromosome-wide significance threshold (−log10(5E-08)), and in (b), the tomato puree points represent SNPs that exceeded the chromosome-wide significance threshold.
In this study, we rediscovered the missing QTL for lumbar number in Sutai pigs using GWAS based on a whole-genome imputation strategy. This QTL includes the same potential causative gene, NR6A1, that was previously reported, while the top SNP differed from the previously reported potential causative mutation. This study illustrates the importance and effectiveness of uncovering the traits in agricultural animals using a whole-genome imputation approach and provides a solution that combines second-generation sequence data with GWAS. Our results also show that this approach can be a powerful strategy to analyze economically important complex traits in livestock. Along with developing good imputation software, exploiting more public database systems will contribute to genotype imputation in the future.