Population-genetic properties of differentiated copy number variations in cattle

While single nucleotide polymorphism (SNP) is typically the variant of choice for population genetics, copy number variation (CNV) which comprises insertion, deletion and duplication of genomic sequence, is an informative type of genetic variation. CNVs have been shown to be both common in mammals and important for understanding the relationship between genotype and phenotype. However, CNV differentiation, selection and its population genetic properties are not well understood across diverse populations. We performed a population genetics survey based on CNVs derived from the BovineHD SNP array data of eight distinct cattle breeds. We generated high resolution results that show geographical patterns of variations and genome-wide admixture proportions within and among breeds. Similar to the previous SNP-based studies, our CNV-based results displayed a strong correlation of population structure and geographical location. By conducting three pairwise comparisons among European taurine, African taurine, and indicine groups, we further identified 78 unique CNV regions that were highly differentiated, some of which might be due to selection. These CNV regions overlapped with genes involved in traits related to parasite resistance, immunity response, body size, fertility, and milk production. Our results characterize CNV diversity among cattle populations and provide a list of lineage-differentiated CNVs.

AMY1 copy number enables the better digestion of starchy foods 26 . An indel polymorphism in gene APOBEC3b has been associated with malaria susceptibility 27 . The human UGT2B17 gene shows significant copy-number diversity among populations from Africa, Europe, and East Asia, which displays region-specific differences in the metabolism of steroid hormones and a large number of xenobiotics 28 . Another well-known example is the olfactory receptor (OR) genes, which are frequently found to be copy-number variable in most mammalian species. The differences in OR gene counts between human populations suggested that they are involved in population-specific differences in smell 29 . In addition, CNVs are specifically enriched among evolutionary "young" ORs, implying that CNVs may play a critical role in the processes of gene birth and death or the emergence of new OR gene clusters 30 .
In livestock, such as cattle, most CNV studies have limited themselves to CNV detection and enumeration using various platforms, such as CGH array, SNP array or next generation sequencing [31][32][33][34][35][36][37][38][39] . Even though the aforementioned studies have identified a large number of copy number variable regions in their respective species, exploring livestock population genetics using cattle CNVs is still in its infancy. The investigation of diversity and origin of CNVs, the characterization of their population-genetic properties, and the determination of the functional impacts of CNVs are still active areas of research.
Here, we report a comprehensive population-genetics study of CNVs by focusing on the diversity, population structure, and selection of identified CNVs within eight representative cattle breeds. In this study, we investigated CNVs from individuals originating from European taurine, indicine, and African taurine breeds of the Bovine HapMap DNA panel 40 . Our results revealed that most common CNVs, especially CNV deletions, show large differences in frequency across diverse groups. More importantly, we demonstrated that CNVs can be used for the investigation of population genetics in cattle, as we observed CNVs with significant diversity across groups that might be associated with breed and sub-species specific selection signatures.

Results
CNVs segmentation and genotyping. A total of 300 individuals was used for CNV discovery as shown in Table S1, including Holstein (HOL), Angus (ANG), Hereford (HFD), Brown Swiss (BWS), Brahman (BRM), Nelore (NEL), N'Dama (NDA), and Sheko (SHK). In total, 155,700 CNV segments were extracted by Golden Helix SVS 8.0 using the default multivariate option. After merging across all individuals, we discovered 263 non-redundant CNVs which are commonly shared within the whole population (Table S2). Since the SVS multivariate option was developed to identify moderate to high frequency CNVs, only segments with frequencies above 1% were retained for further analysis in order to filter away potential false positive calls. Finally, a total of 257 CNVs (with a total length of 12,444 kb and an average length of 48.4 kb) were retained and used to categorize the samples as one of three types (loss, neutral and gain events) according to a three-state model with strict threshold levels of marker mean log R ratio (LRR) ± 0.3. They were sorted as a list of CNV1 to CNV257 with a descending frequency, in which there were 184 deletion CNVs (Table S2). As shown previously 41,42 , comparisons of CNV detection algorithms usually revealed a low concordance. However, when we compared this dataset with our previous results using PennCNV in the same Bovine Hap Map samples 32 , we obtained a total of 160 concordant CNVs (61%), indicating a high quality of our SVS results. While all 257 CNVs were used for frequency and V ST calculations, only the 184 deletion CNV regions were used in all other subsequent population genetics analyses.

Population-genetic properties of cattle CNVs.
Hierarchical Clustering Analysis. To obtain a global picture of group differences, hierarchical clustering was done using the mean LRRs for the 257 CNVs. Three distinct groups were observed, including group one European taurine (TAU) containing HOL, ANG, HFD, and BWS; a second indicine (IND) group containing BRM and NEL; and third group African taurine (AFR) containing NDA (Fig. 1). SHK, which used to be considered a taurine population, because they are humpless, was positioned between IND and AFR confirming SHK is a hybrid breed in agreement with its known breed formation history 43 .
Multidimentional Scaling (MDS) Analysis. To examine the population structure of these three cattle groups based on CNVs, a multidimensional scaling (MDS) analysis was completed on 205 unrelated individuals based on the 184 deletion CNVs ( Fig. 2A). We found C1 axis can clearly separate taurine (TAU and AFR) from indicine (IND), while C2 axis can separate African taurine (AFR) into its unique cluster with a small amount of intermixing with European taurine (TAU). Therefore, the global organization of cattle genetic diversity can be represented as a triangle with apexes corresponding respectively to TAU, IND, and AFR groups. As expected, we observed that SHK was located between AFR and IND, again confirming its hybrid breed formation history. We found this CNV-based MDS results are generally consistent with the results from a similar SNP-based analysis 40,44 , suggesting CNVs can be used to separate cattle individuals into distinct groups. However, the clustering resolution within groups based on CNVs was not better than those based on SNPs. For example, CNVs cannot distinguish the HOL breed from the ANG breed in European taurine cattle. There were also certain degrees of mixing within indicine individuals in the CNV-based clustering results ( Fig. 2A). In summary, our results revealed that CNV can be used in population genetic studies. However, compared to SNP, CNV suffers from small sampling size and difficulty to genotype, making it difficult to use them to do fine clustering, especially within a group.
Admixture Analysis. To investigate genome wide ancestral admixture patterns of these eight breeds, we used the admixture inference method implemented in STRUCTURE (Fig. 2B). Varying the number of presumed ancestral populations (K) recapitulated the extent of genetic divergences across breeds. At K = 2, TAU and AFR were clearly assigned into unique groups distinct from IND. At K = 3, the clustering analysis revealed TAU was separated from AFR showing a clear separation of TAU, IND and AFR groups. At K = 4, intriguingly, European taurine  beef breeds was separated from their dairy counterparts. At K = 5, BWS was deviated from HOL. Finally at K = 6, HFD was separated from ANG and most of the samples were clustered according to breed designation, except that the NEL and BRM breeds were still clustered together. In addition, increasing the number of inferred clusters allowed us to confirm a high level of admixture and support the documented origin of SHK, which accommodated high fractions of admixture from ancestries of AFR and IND. Overall, these results were in agreement with our MDS analysis, suggesting that the partitioning of cattle into distinct populations is closely related to genetic diversity, which is in agreement with the earlier report by Bovine HapMap Consortium 40 .
Neighbor-Joining Clustering Analysis. In addition, we calculated all pairwise genetic distances using PLINK 1.07, and plotted a neighbor-joining dendrogram of all individuals (Fig. 2C). We found that the genetic relationship among cattle groups could be largely recovered from this dendrogram as it clearly arranged individuals according to their population of origin. Although two indicine breeds (BRM and NEL) are intermixed, the three breed groups (TAU, AFR and IND) can be easily distinguished. In agreement with MDS and admixture results, individuals from SHK branched between AFR and IND. This clustering analysis of individual samples supports most of the relationships among the cattle breeds uncovered by our MDS and STRUCTURE analysis.
Population diversity and differentiated CNVs. Using 257 CNVs, we estimated the CNV frequencies  (Table S2). Using the 205 unrelated individuals, we estimated the frequencies for each group to investigate CNVs with differentiated frequencies.
We compared CNV regions with the UMD 3.1 Ensembl gene (EnsGene) annotation and found 101 CNVs partially overlapping with genes (Table S3). We performed gene ontology (GO) analysis using PANTHER 45 to identify if there was enrichment of genes with specific function (Table S4). The most enriched biological processes include response to interferon-gamma (Interferon-Induced Guanylate-Binding Protein 2), other immunity related processes, and response to stimulus (MHC, immunoglobins, ORs and ATP-binding cassette (ABC) transporters). Aside from 99 CNVs that overlapped with EnsGene genes, we found 158 CNVs which did not encompass any EnsGene genes. When comparing CNVs overlapped with genes to CNVs not overlapped with genes, we found that the CNVs without genes were shorter (with Mean 23370 ± Standard Error of Mean 5659, N = 158 vs. 88400 ± 18500, N = 99, t-test, p-value < 0.0001) and at higher frequency (0.4218 ± 0.0189, N = 158 vs. 0.3468 ± 0.0218, N = 99, p-value = 0.0113). The paucity of common deletion CNVs overlapping with genes is consistent with the notion that they are under purifying selection, which removes deleterious variants from the population. We noted that our length and frequency analysis could be confounded if the power to detect short events is higher than long events and the power to detect common deletions is higher than common duplications. Besides other neutral possibilities, like those indicated in an early human study 14 , CNVs may also affect gene expression through regulatory level changes.
CNVs that differ greatly in frequency between cattle groups/breeds are candidates for population-specific selection. To test whether any CNV might be associated with population-specific selection, we estimated the pairwise V ST for 4 comparisons, including TAU vs. IND, TAU vs. AFR, IND vs. AFR, and HOL vs. ANG (Fig. 3,  Fig. S1, and Table S3). V ST estimations produce values from 0 (no difference) to 1 (complete population differentiation), with high V ST values indicating regions under increased selective pressure or other evolutionary forces, such as bottlenecks or founder effects. Using a stringent threshold cutoff of V ST > 0.6, we detected a total of 14, 0, 18, 1 CNV(s) in the abovementioned four comparisons, respectively. When we lowered the threshold to 0.4, we observed 41, 11, 48, 2 CNVs for the four comparisons, respectively.
The higher differential V ST identified in these CNV regions may suggest the dosage variability of their underlying genomic sequence, which could be further involved in the diverse phenotypes across cattle breeds. For instance, when comparing TAU with IND under the lower threshold of 0.4, we observed ten genes overlapped with CNV regions, including CDH18, GDAP1L1, HIATL1, IGLL1, ITGB8, KCNIP3, LCT, NETO1, OIT3, and SHISA9. Similarly for the comparisons of TAU vs. AFR and IND vs. AFR, we found nine genes (EPHB3, FANCC, GRM7, HSFY2, KCNJ12, LIPF, PRAME, TSPY, and ZNF280B) and nine genes (GDAP1L1, HIATL1, LCT, MRPL48, MSMB, PLCB1, RBFOX1, ROBO4, and SHISA9) overlapping with CNV regions, respectively. Although for some genes, only small parts were covered by CNVs, the change of these small regions could potentially influence their function and evolution. We further overlapped these genes with the known cattle quantitative trait locus (QTL at http://www.animalgenome.org/cgi-bin/QTLdb/BT/index) and found genes associated with important traits in cattle that vary in copy number frequencies across populations (Fig. S2). For example, some of them are related to parasite resistance, immunity response and adaption, including EPHB3, SHISA9 and LCT 46,47 . Other genes were reported to be involved in body size, fertility, production and milk fatty acid profile, for examples FANCC, IGLL1 and LIPF 48,49 .

Discussion
Population genetics studies based on CNVs have been explored in human, dog, zebrafish, and stickleback fish 8,9,23,50 . Despite that previous studies have identified CNVs within and between populations in cattle, our study is one of the first attempts to explore the population-genetic properties in cattle based on CNVs derived from the high-density SNP array. We also provided additional evidence to support CNVs as genetic markers that can be used to study the across population diversity and capture the subspecies relationships. Since it was difficult to accurately detect and genotype complex CNV events, like non-biallelic duplications, in this proof-of-principle study, we mainly used high confidence deletion CNVs. The distinct advantage of deletion CNVs over duplication events is that deletions can be treated as bi-allelic markers, and are therefore compatible with mature genetics analysis methods designed for SNP markers. In the current study, we used CNVs with moderate intra-population frequencies to explore the population-genetic properties in cattle. Although the SVS method we utilized reported a limited amount of CNV calls, including 71.6% deletions and 28.4% duplications, our study did reveal that globally diverse cattle populations clustered roughly by geographical region and were influenced by demographic history, which is similar to the results derived from previous SNP-based cattle studies 40,51,52 . This could be due to the fact that 75% of the CNVs that we identified were well tagged by flanking SNPs, as was estimated previously 53 . We also found the results of this survey were not capable of distinguishing recently divergent cattle breeds of common geographic origin, such as HOL and ANG breeds, which were reported to be separated by 364 generations 54 . This observation was consistent with human studies, where published reports also showed the CNV-based population stratification can be only detected among the large population groups at the continent level 15 . The fine genetic structure detected by SNPs may be due to their accuracy of genotyping and sample size, besides other influencing factors, as suggested previously 16 . Further study based on high throughput sequencing will make it easier to accurately genotype CNVs 11,34 .
To investigate lineage-differentiated CNVs in the cattle genome, we also conducted CNV-based population differentiation analysis, and identified potential CNV candidates under divergent selection. We estimated V ST values, a population differentiation estimator similar to F ST , among groups and examined gene enrichments among CNVs regions. In our previous study based on array CGH in cattle populations, we revealed that regions that have been under recent positive selection exhibit elevated population differentiation 33 . In the current study, we found 78 unique CNVs with V ST values above the threshold of 0.4 as potential lineage-differentiated events in three group comparisons, perhaps representing increased selective pressures exerted upon the cattle population. It is noted that besides selective pressure, the amount of divergence between populations (time since divergence, effective population size, and gene flow/migration) also can affect the overall differentiation and V ST values. High V ST between groups does not necessarily involve divergent selection (selection in both populations for different alleles), and can also occur in the absence of selection, for example, by bottlenecks or founder effects followed by drift. All these hypotheses warrant further investigations using larger sample sizes.
By contrast, we observed only a handful of lineage-differentiated CNVs in our HOL vs. ANG comparison, which did not overlap with any known cattle genes. Fewer lineage-differentiated CNVs may suggest that these two breeds might not have had sufficient time to diversify their deletion CNV contents, if we assume that the CNV occurred before the split of the two breeds, and that the deletion event should eventually fix in one but not the other population. Additionally, fewer lineage-differentiated CNVs were also observed in laboratory mouse and zebrafish strains, as compared with their wild populations 23,55 . This may be indicative of either inbreeding effects or suggest that CNVs were preferentially fixed as a consequence of the larger effective population size of wild populations. It is well known that HOL and ANG have gone through intensive human selection recently via the practice of artificial insemination. In the future, we propose that additional cattle breeds from places like the Middle East, Pakistan and Turkey will provide more insights into the worldwide and local genetic diversity and population structure. We also expect that discovery of low frequency CNV variants, especially in those under-represented breeds like indicine and African taurine cattle will provide additional resolution for distinguishing those populations. Finally, with more powerful software tools 56 , we predict population genetics in livestock will remarkably expand with next generation sequencing data.

Samples.
In the CNV discovery phase, we retrieved a subset of Illumina BovineHD SNP dataset (300 individuals, Table S1), which represent 8 geographically diverse breeds, including Holstein, Angus, Hereford, Brown Swiss, N'Dama, Sheko, Brahman, and Nelore 32 . All chosen samples had a genotyping success rate of more than 99%. For population genetic analyses, we only used 205 animals after removing related individuals according to pedigree information and pi-hat value if it was more than 0.4.
CNV segmentation and genotyping. The intensity data of 742,910 SNP probes were generated using the Illumina BovineHD SNP array. After exporting the DSF file from GenomeStudio Software, we imported Log R Ratio (LRR) into Golden Helix SNP & Variation Suite (SVS) 8.0 (Golden Helix Inc., Bozeman, MT, USA) and successfully mapped 735,293 SNPs (98.97%) onto the 29 autosomes of Bos taurus genome assembly UMD 3.1. The LRR was then normalized using the default GC correlation file to correct the waviness caused by the GC content. We then utilized the copy number analysis module (CNAM) under the multivariate option to segment chromosomes with default set, and a significance level of p = 0.01 for pairwise permutations (n = 1,000) as described previously 53 . The three state covariates with a comparatively strict threshold (segment mean 0.3) was used to genotype the CNVs as one of three type (loss, neutral and gain events) across all the samples. It was noted that the multivariate method tends to detect the common deletions with relative small sizes across multiple samples.
Population differences across population. We first checked the normalized LRR distribution histograms for all 184 deletion CNVs. The grand majority (99%) of deletions had two distinct peaks, representing neutral (around 0) and homozygous deletion (around -1) states, respectively. Only a couple of deletions had a few samples located in the midpoint between -1 and 0, suggesting a lack of heterozygous events. Additionally, it was difficult to define a universal threshold between homo or heterozygous deletions for all deletions, therefore, we decided to categorize all deletion events using one state: homozygous deletion. To use population genetic programs originally developed for SNPs, we manually recoded each 184 deletion CNVs by converting a loss event into "12" or a neutral event into "22", where "12" represented a homozygous deletion.
The R Function heatmap.2 (http://www.inside-r.org/packages/cran/gplots/docs/heatmap.2) was used to graph the segment mean LRR values and generate hierarchical cluster dendrograms using 257 CNVs for all animals. We then performed multidimensional scaling (MDS) and admixture analysis to determine how 205 unrelated individuals were clustered according to these CNV genotypes. Using a total of 184 deletion CNVs, MDS analysis of pairwise genetic distance (4 dimensions) was used to detect the relationship between populations with PLINK 1.07 (-mds -plot 4). For a separate verification, we also performed the cluster analysis based on mean LRR values using prcomp functions in R v13.1, the results were consistent with MDS analysis.
Population structure was examined using STRUCTURE 2.3 57,58 . Each admixture analysis was performed using 5,000 replicates and 2,000 burn-in cycles under admixture and allele frequencies correlated models.

Gene annotation and PANTHER analysis.
We retrieved RefSeq, Ensembl, and xenoRefSeq genes overlapping CNV regions by at least 1 bp, including the 5′ and 3′ untranslated regions, from available UCSC genome browser tracks and annotated CNV regions using custom software (https://github.com/njdbickhart/ AnnotateUsingGenomicInfo). We performed enrichment analysis using PANTHER classification system 45 . Only clusters with enrichment scores more than 1 (p-value < 0.05 after the Bonferroni correction for multiple testing) were considered.

Signatures of Selection.
To detect the lineage differentiated CNV events, we calculated V ST for each CNV as previously described 15 by using the following equation: (V T − V S )/V T , where V T is the total variance in mean LRRs across all individuals and V S is the average variance in cattle within each breed.