Whole genome sequence analysis reveals genetic structure and X-chromosome haplotype structure in indigenous Chinese pigs

Chinese indigenous pigs exhibit considerable phenotypic diversity, but their population structure and the genetic basis of agriculturally important traits need further exploration. Here, we sequenced the whole genomes of 24 individual pigs representing 22 breeds distributed throughout China. For comparison with European and commercial breeds (one pig per breed), we included seven published pig genomes with our new genomes for analyses. Our results showed that breeds grouped together based on morphological classifications are not necessarily more genetically similar to each other than to breeds from other groups. We found that genetic material from European pigs likely introgressed into five Chinese breeds. We have identified two new subpopulations of domestic pigs that encompass morphology-based criteria in China. The Southern Chinese subpopulation comprises the classical South Chinese Type and part of the Central China Type. In contrast, the Northern Chinese subpopulation comprises the North China Type, the Lower Yangtze River Basin Type, the Southwest Type, the Plateau Type, and the remainder of the Central China Type. Eight haplotypes and two recombination sites were identified within a conserved 40.09 Mb linkage-disequilibrium (LD) block on the X chromosome. Potential candidate genes (LEPR, FANCC, COL1A1, and PCCA) influencing body size were identified. Our findings provide insights into the phylogeny of Chinese indigenous pig breeds and benefit gene mining efforts to improve major economic traits.

identifying selective sweeps during domestication 14,15 . Such research included genome-wide analyses of domestic breeds (e.g., Tibetan 14 , Tongcheng 10 , Enshi Black 13 , and Rongchang 16 ) with a focus on tolerance to harsh environments, high fertility, and body size. Currently, too few Chinese pig breeds have been studied to provide a conclusive investigation of porcine evolution in China. Specific loci and genes underlying common phenotypic variation among Chinese domestic pig breeds have not yet been studied.
To address these deficiencies, we performed whole-genome resequencing of pigs representing 22 breeds distributed across different geographical areas in China. This new sequence data was integrated with publically-available sequence data from seven other pig breeds, including European breeds. We uncovered population genetic structures among Chinese indigenous pigs, genetic introgression between population pairs (North China, South China, and Europe), LD patterns of X-chromosome, along with potential candidate genes associated with body size.
To these data, we included genomic data 12,17 publically available for seven pigs of wild and commercial European and Chinese breeds ( Table 1). The combined dataset had 14.09 billion high-quality raw reads (1,281.12 Gb raw bases, >90% Q30 bases) ( Supplementary Fig. S3).
A strict quality-filter pipeline resulted in 19,685,697 single-nucleotide polymorphisms (SNPs) from 31 pigs (Supplementary Table S3). Of these SNPs, 13,430,360 (68.22%) were in intergenic regions, 1,223,834 (6.22%) were 5-Kb upstream or downstream of gene regions, and 5,031,503 (25.56%) were within gene regions. The last group contained 46,618 non-synonymous (NS) and 53,028 synonymous (S) SNPs ( Supplementary Fig. S4), leading to an NS/S ratio (ω) of 0.88, which is higher than the ratio of 0.68 reported by Li et al. 14 . This study collected more local pig breeds in China than Li et al. 14 , resulting in a higher NS/S ratio. In this study, 20 Chinese domestic pig breeds covering the whole country were collected. In the study of Li et al. 14 , although the number of individuals reached 45, only Tibetan pigs and five other local Chinese breeds distributed in Sichuan and Chongqing were collected.
Homozygous (Hom) and heterozygous (Het) SNPs were classified per individual. Homozygous SNPs were more common in all European pigs than in Chinese pigs, especially in two European wild boars that had Hom/ Het SNP ratios of 3.804-4.460 (Supplementary Table S3). Furthermore, except for the Large White (LW) pig, higher Hom/Het ratios of indels were observed in European pigs than in Chinese pigs, which was consistent with that of SNP variants (Supplementary Table S4). These results suggest that population bottlenecks may be responsible for the reduced genetic diversity observed in European pigs compared with Chinese pigs 17 . Additionally, numerous specific alleles appear to have been fixed in European and Chinese populations after separation. population structure and introgression. We constructed a non-rooted phylogenetic tree based on 9.2 million population SNPs ( Fig. 1a and Supplementary Fig. S6) to understand the genetic relationships and structure among Chinese pigs with different geographical distributions. The estimated phylogeny revealed that the primary division was between European and Chinese pigs, European wild boars clustered with European domestic pigs, and Chinese wild boars clustered with Chinese domestic pigs, consistent with previous studies 14,17 . Our results lend further support to the viewpoint that pig domestication occurred independently in western Eurasia and East Asia. Moreover, Chinese domestic breeds split on geographical grounds, namely into South and North China (CnSouth and CnNorth) subpopulations. The former encompassed all individuals from the classical South Chinese Type and some of the Central China Type. The latter comprised the remainder of Central China Type and all those from the remaining four types (North China, Lower Yangtze River Basin, Southwest, and Plateau) (Fig. 1a). The genetic relationships among Chinese indigenous pig breeds were remarkably congruent with geographic distribution. Dahua Bai (DH: Xingning City, Guangdong Province, South China) clustered with South Chinese Type breeds and Jinhua (JH: Jinhua City, Zhejiang Province, Yangtze River lower reaches) clustered with Lower Yangtze River Basin Type breeds ( Fig. 1a and Table 1). Notably, DH and JH are considered to be of the Central China Type, a consideration based on coat color phenotypes 4 . The reference genome selected in this study was also from inbred Wuzhishan pig, which belonged to the same inbred population as WZSI used in this study. After nearly 20 generations of inbreeding, the inbred line has formed distinct genetic differentiation with other local Chinese pigs, leading to a separate cluster, including the reference genome sample and WZSI at K = 3 (Fig. 1c).
Principle component analysis (PCA) confirmed the phylogenetic analysis ( Fig. 1b and Supplementary  Table S6). Furthermore, a model-based clustering analysis with proportional contributions from five ancestral populations revealed the same subpopulations (CnNorth and CnSouth). Northern Chinese pigs could be further split into two subgroups (Fig. 1c): Subgroup 1 consisted of the Lower Yangtze River Basin and North China types, and Subgroup 2 comprised the Southwest and Plateau types. Features of genetic structure (Fig. 1c) and geographical distribution ( Supplementary Fig. S1) confirmed the three East-Asian centers of pig domestication identified initially through mitochondrial DNA. These centers are the Mekong region 18 , middle and downstream regions of www.nature.com/scientificreports www.nature.com/scientificreports/ the Yangtze River 19,20 , and Tibetan highlands 18,20 . Thus, our study provides evidence that the classical classification scheme 4,21 requires updating with genetic information.
Our three analyses of population structure (phylogeny, PCA, and clustering analysis) (Fig. 1a-c) revealed that admixture likely took place in six Chinese indigenous breeds. Therefore, we employed the haplotype sharing ratio to examine putative introgression across all pairs of four populations (South China, North China, Europe, and admixed, including domestic and wild pigs) corresponding to our model-based clusters (Fig. 1c). We examined nucleotide variation (θπ and θw) to measure genetic diversity across three populations (wild pigs, European domestic pigs, and Chinese domestic pigs) and the two Chinese subpopulations (CnNorth and In comparison with wild and Chinese domestic pigs, European domestic pigs have a lower level of genetic diversity (θw/Kb:2.01, θπ/Kb: 2.12). We then calculated the divergence index (F ST ) to measure population differentiation between the different domestic pigs and wild pigs and between the two subpopulations ( Supplementary Fig. S8). The highest F ST (0.08) was observed between European domestic pigs and wild pigs. The LD decay rate was measured by the average distance over which the LD coefficient (r 2 ) falls to half of its maximum value ( Supplementary Fig. S9). The LD decay rate of European domestic pigs (~27.60 kb, r 2 0.5 = 0.33) was lower than that of the other two populations (wild pigs: ~7.30 kb, r 2 0.5 = 0.25; and Chinese domestic pigs: ~6.00 kb, r 2 0.5 = 0.27), which might be a result of the low genetic diversity in European domestic pigs. Taken together, our results from genetic diversity and LD decay in European domestic pigs support the hypotheses of expansion from a relatively small ancestral population 14,17 and a large reduction of effective population size under intensive breeding 27 .
The bottleneck effect can greatly change the allele frequency of sites in the population, which is the main reason for the drastic change of LD in a short time 28 . In our study, within a short LD decayed distance (<30 Kb), wild pigs had lower r 2 than Chinese pigs. However, higher r 2 at a longer distance (≥30 Kb), suggests that the ancestral population from wild boars was larger than that from Chinese domestic pigs, but wild boars were subjected to narrow bottlenecks. The similar signatures of narrow bottlenecks within LD patterns have also been reported from different cattle populations 24 . Finally, CnNorth and CnSouth exhibited low population differentiation (F ST = 0.06) and similar nucleotide diversity and LD decay rate (Supplementary Table S7  Characterization of a large-scale LD block in the X chromosome. Using SNP data, we identified a large-scale LD block (40.09 Mb, 44,595,487-84,684,295 bp) (Fig. 2) in the X chromosomes of all 31 pigs. This region was previously shown to have an extremely low recombination rate (48 Mb segment, 44.0-91.5 Mb) 15,29 , and spanned the centromeric region (47.3-49.2 Mb). We observed three major haplotypes after selecting SNP markers with inter-marker distances of 3 Kb. Haplotype S was unique to domestic and wild pigs of southern China, whereas N was present in northern Chinese wild pigs, European domestic pigs, and European wild pigs. The third was a recombinant haplotype set that included six haplotypes (N-S-1 to N-S-6) found only in northern www.nature.com/scientificreports www.nature.com/scientificreports/ Chinese domestic pigs (Fig. 2). These LD patterns indicate that northern Chinese domestic pigs exhibit more haplotype diversity and they corroborate previous findings of a 14 Mb X-linked sweep region 12,15 .
We then used all SNP markers from the LD block to detect intervals of local breakdown in LD in the haplotype set. We identified two intervals of reduced recombination: interval 1 (left) at 46, 219, 219-46, 419, 569 bp and interval 2 (right) at 56, 819, 762-57, 752, 631 bp. The minimum distance between the two intervals was a 10.40 Mb segment (46, 419, 569-56, 819, 762 bp) (Fig. 3), a highly conserved portion of haplotype N in northern Chinese domestic pigs. Moreover, the 10.40 Mb segment is located inside the 14 Mb X-linked sweep 15 . Overall, we found more haplotypes (n = 8) within the 40.09 Mb LD block and a shorter conserved region (10.40 Mb) than described in the previous reports 12,15,29 , which were likely due to our use of high-density genetic markers from data with high sequencing depths and from obtaining a greater number of Chinese pig breeds.
The 40.09 Mb LD block contained 189 annotated genes, 143 (75.66%) and 108 (57.14%) of which contained SNPs and nonsynonymous substitutions, respectively. KEGG analysis mapped these 189 genes onto the Shigellosis and Neurotrophin-signaling pathway (Supplementary Tables S8 and S9). Of the 374X-chromosome QTLs in the Pig Quantitative Trait Locus database (Pig QTLdb), we aligned 247 (66.04%) to the Wuzhishan pig genome. Furthermore, 47X-chromosome QTLs overlapped with the 40.09 Mb LD block. Thirty-seven (37/47, 78.72%) and seven (7/47, 14.89%) QTLs were related to meat and carcass quality and reproduction, respectively (Supplementary Table S10). Within the meat and carcass quality associated QTLs, 26 (26/37, 70.27%) were related to fat traits (3 fat composition and 23 fatness QTLs), consistent with lipid-metabolism QTLs identified near the X-chromosome centromere 30 . Trait hierarchies for reproduction associated QTLs from the Pig QTLdb are divided into four categories: endocrine, litter traits, reproductive organs, and reproductive traits. In this study, the seven overlapping QTLs associated with reproduction traits were assigned to the reproductive organs, reflecting between-subpopulation (CnNorth, CnSouth, and European) differences in reproductive characters.
Across CnNorth and CnSouth pigs, we identified 4,169 population-level indels in CDS regions of functional genes. After filtering out markers that covered samples less than 5 in one group to meet the minimum requirement of an expected value of chi-square statistics, 2,711 indels remained. Six differed significantly between the two subpopulations, and five of these were distributed in three gene loci (ENSSSCG00000012830, HUWE1, and ITIH5L) in the 10.40 Mb conserved region (Supplementary Table S11). The first locus contained three indels that were matched against the InterPro database to reveal two specific cold-shock protein domains (IPR002059 and IPR011129). Variants of these genes in the CnNorth pigs were also found in northern Chinese wild pigs and European domestic and wild pigs. www.nature.com/scientificreports www.nature.com/scientificreports/ We next selected the top 100 SVs out of 64,876 population-level SVs that exhibited significantly non-random distribution (χ 2 test with FDR correction, P < 0.01). Thirty-four of these SVs were located in the X chromosome (Supplementary Table S12  www.nature.com/scientificreports www.nature.com/scientificreports/ Identification of candidate genes for body size. Our sample was split into small pigs (adult body length ≤100 cm, height ≤50 cm; N = 7) and large pigs (adult body length ≥120 cm, height ≥65 cm; N = 7), based on early phenotype characterization records 21 and our own measurements (Supplementary Table S14). We then identified 115 nonsynonymous substitutions, distributed in 95 gene regions, that differed in allele frequency between large versus small pigs (>80% in one group, approaching fixation; <20% in the other) (Supplementary  Table S15). These nonsynonymous substitutions were putative candidate polymorphisms that resulted in size differences. Indeed, two genes (LEPR and FANCC) overlapping with nonsynonymous substitutions are reported as associated with body growth and development in some mammals 33,34 . In humans, impaired LEPR function exerts a strong negative effect on ponderal index at birth and height in adolescence 34 . Likewise, FANCC plays a major role in skeletal formation, and thus affects human height 35,36 .
We then analyzed differences (χ 2 -test with Bonferroni's correction) in frequency of indels and SVs between large and small pigs, to understand their effects on body size. We found significant (P < 0.05) between-size-group differences for 10 indels and 20 SVs, located within 7 and 10 functional genes, respectively (Supplementary Tables S16 and S17). For all the seven small pigs, we identified a 4 bp insertion in the third exon of COL1A1. COL1A1 is an α1(I) protein chain of type I collagen and a major structural component of bone. Nonfunctional COL1A1 markedly reduces skeletal mineral density and body height 37,38 . We also found a 430 bp deletion in the third intron of the gene encoding propionyl CoA caboxylase α subunit (PCCA). A genetic defect in PCCA causes propionic acidemia, a condition that can lead to bone disease and growth failure 39 .

Materials and Methods
Samples. All Supplementary Table S1. We also included samples from southern and northern Chinese wild pigs (n = 4), as well as European wild and commercial pigs (n = 7) ( Table 1 and Supplementary Fig. S1). Altogether, data from 31 individual animals were used in this study: (i) 24 sampled from 22 breeds, which were handled by the South China Agricultural University, Guangzhou, People's Republic of China (Table 1 and Supplementary Fig. S1) and (ii) seven (one pig per breed) downloaded from the Wageningen University Porcine Re-sequencing Phase 1 Project (http://www.ebi.ac.uk/ena/data/view/ERP001813) 12,17 with the highest sequencing depths to supplement the breeds sampled here (Table 1). Seven small pigs and seven large pigs were used to detect candidate genes for body size (Supplementary Table S14). Body size data were obtained for 14 pigs, 11 from the book Animal genetic resources in China: pigs 21 , and three were measured according to the technical specifications for the registration of breeding pigs (NY/T 2004). A completed ARRIVE guidelines checklist is included in Table 1.
DNA isolation and genome sequencing. Genomic DNA was extracted from ear tissue of live collection using a phenol-chloroform-based method. For each sample, 1-15 µg of DNA was sheared into 200-800 bp fragments using the Covaris system (Life Technologies). Fragments were then treated according to the Illumina DNA-sample-preparation protocol. For library construction, fragments were end-repaired, A-tailed, ligated to paired-end adaptors, and PCR-amplified with 500 bp inserts. Sequencing was performed to generate 100 bp paired-end reads on the HiSeq 2000 platform (Illumina), following the manufacturer's protocol.
Sequence alignment and genotype calling. Filtered reads were aligned to the Wuzhishan pig draft genome assembly (minipig_v1.0) 40 using the Burrows-Wheeler Aligner 41 . This genome was selected as the reference 7,40 after considering the geographical distance and genetic divergence among the 31 breeds (Table 1 and Supplementary Fig. S1 and Table S1).
Aligned bam files were sorted and indexed in Picard-tools version 1.117. Two GATK (Genome Analysis Toolkit version 2.4-9 42 modules, RealignerTargetCreator and IndelRealigner), were used to realign the SNPs around indels in bam results. To obtain high-quality variants, additional GATK modules HaplotypeCaller and SAMtools 43 were used to call variants for each sample. Only concordance variants were selected, and SNPs were filtered with the parameter "QD < 2.0 | | FS > 30.0 | | MQ < 40.0 | | DP < 6 | | DP > XXX | | ReadPosRankSum < -8.0 | | BaseQRankSum < -8, " while indels were filtered with "QD < 2.0 | | FS > 30.0 | | ReadPosRankSum < -8.0. " These variants were used to perform base quality score recalibration (BQSR), and resultant reads were applied calling population variants, done with the GATK HaplotypeCaller module using the parameter "-genotyping_ mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30. " To detect structural variants, we followed an existing method 44 , with some modifications. Reads were assembled into contigs and scaffolds using default parameters in SOAPdenovo. The assembled scaffold was mapped to the reference genome in BLAT 45 , with the -fastmap option.
Criteria for determining the most well-aligned scaffold included coverage length in a given region and high contig support. Selected scaffolds and reference-genome regions with the highest alignment were extracted and aligned to each other in LASTZ (http://www.bx.psu.edu/miller_lab/). Unmapped scaffolds were further aligned against the reference genome using BLASTn. Structural variants were extracted based on all aligned regions. phylogenetic and population genetic analyses. Genetic structure was inferred from high-density SNP data in FRAPPE 46 , a program that applies maximum likelihood and expectation-maximization to estimate individual ancestry and admixture proportions. To explore individual convergence, we predefined the number of (2020) 10:9433 | https://doi.org/10.1038/s41598-020-66061-2 www.nature.com/scientificreports www.nature.com/scientificreports/ genetic clusters from K = 2 to K = 5. The maximum iteration of the expectation-maximization algorithm was set to 10,000.
A phylogenetic tree was generated from population-level SNPs in TreeBeST (http://treesoft.sourceforge. net/treebest.shtml), under the p-distances model. Population-level SNPs were then subjected to PCA in EIGENSOFT 47 , and eigenvectors were obtained using the R (https://www.r-project.org/) function eigen.
Gene and QTL annotation. Pathway analyses of candidate genes were performed using KEGG (https:// www.genome.jp/kegg/pathway.html). KEGG analysis is mainly performed by the following three steps: 1) Extract the nucleoside and protein sequences of the target gene, 2) Align the protein sequences to the KEGG animal database with the alignment software BLAST3, 3) Classify each gene according to the annotation information. Additionally, identified QTLs were functionally characterized using Pig QTLdb (https://www.animalgenome. org/cgibin/QTLdb/SS/index, Release 23, Apr 21, 2014), specifically with coordinate conversion of the Wuzhishan genome to the European-Duroc reference genome (Sscrofa10.2). Indels were matched to the InterPro database using EBI InterProScan (https://www.ebi.ac.uk/interpro/search/sequence-search). introgression analysis. Methods described in a published study 50 were used. We applied a likelihood ratio test to study the ancestral contribution of groups to the genome of each individual pig. All putative introgressions between group pairs (North China, South China, and Europe) were examined. For every 100 Kb window containing at least 10 SNPs and when at least three comparisons were possible per group, we calculated the ratio of the average sharing per pig with its own and another group. Regions with an average sharing ratio of <0.8 were defined as introgressions. Shared introgression frequency was plotted and tabulated. Introgression length and number per pig were also tabulated. Regions of extensive haplotype sharing (≥90% shared SNPs) were considered introgressed regions for each group pair.

Data availability
The datasets generated and analysed during the current study are available from the corresponding author on reasonable request.