Linkage disequilibrium in Brazilian Santa Inês breed, Ovis aries

For genomic selection to be successful, there must be sufficient linkage disequilibrium between the markers and the causal mutations. The objectives of this study were to evaluate the extent of LD in ovine using the Santa Inês breed and to infer the minimum number of markers required to reach reasonable prediction accuracy. In total, 38,168 SNPs and 395 samples were used. The mean LD between adjacent marker pairs measured by r2 and |D′| were 0.166 and 0.617, respectively. LD values between adjacent marker pairs ranged from 0.135 to 0.194 and from 0.568 to 0.650 for r2 for |D′| across all chromosomes. The average r2 between all pairwise SNPs on each chromosome was 0.018. SNPs separated by between 0.10 to 0.20 Mb had an estimated average r2 equal to 0.1033. The identified haplotype blocks consisted of 2 to 21 markers. Moreover, estimates of average coefficients of inbreeding and effective population size were 0.04 and 96, respectively. LD estimated in this study was lower than that reported in other species and was characterized by short haplotype blocks. Our results suggest that the use of a higher density SNP panel is recommended for the implementation of genomic selection in the Santa Inês breed.


Results and Discussion
Descriptive statistics. After quality control (QC), 38,168 autosomal SNPs remained comprising approximately 53% of the entire panel. The SNPs retained after QC spanned a total of 299.63 megabases (Mb) of the genome, with a mean (standard deviation) distance between adjacent SNP of 0.07 (0.075) Mb. This value was close to that obtained by Liu et al. in Spanish Churra sheep (0.06 Mb) 14 . SNPs were evenly distributed throughout the genome as the distances between adjacent markers ranged from 0.064 to 0.085 Mb. The chromosomes differ in size and SNP quantity, with chromosome 24 being the smallest in size -OAR24 (44.21 Mb). Liu et al. 14 observed a similar behavior considering the same SNP panel (OAR24-44.85 Mb), with OAR24 being the smallest chromosome (44.85 Mb) whereas the OAR2 was the largest (263.11 Mb). The number of SNPs per chromosome was proportional to the size of each chromosome. Descriptive statistics of the SNP and LD (r 2 and |D′|) for each chromosome are presented in Table 1.
In addition, 35% of the SNPs (18,716) had minor allele frequency (MAF) lower than 0.20, with a mean MAF over all SNPs of 0.35. According to another sheep study, 33% of the SNPs had MAF lower than 0.20 22 . Extending our comparison to other species, the mean MAF was relatively higher than those found for Bos taurus indicus, with values ranging from 0.19 to 0.25 23,24 . The MAF is important because LD, independent of the metric used, is a function of allelic frequency. In general, low MAF may correspond to a larger difference in allele frequency of coupled alleles, which can result in lower estimates of LD as measured by either r 2 or |D′| 25 . Consequently, applying QC and the choice of QC criteria can affect the distribution and extent of LD 6 . Inbreeding coefficient and effective population size. For a better understanding of the population described in this study, inbreeding coefficient (F) and effective population size (N e ) were estimated for all chromosomes together and for each chromosome separately, using genomic information. The estimate of F was 0.04, a relatively low coefficient for a population that originated from the same commercial herd. Using pedigree information to estimate the inbreeding coefficient, Pedrosa et al. found values equal to 0.02 in the Santa Inês breed 26 . Al-Mamun et al. found average inbreeding coefficients for Merino, Border Leicester and Poll Dorset equal to −0.013, 0.09 and 0.02, respectively 13 . A recently published study in ovine found average inbreeding coefficients based on excess of homozygosity (standard deviation-SD) of −0.008 (0.031), ranging from −0.079 to 0.301 12 . Compared with Kijas et al. 11 and Liu et al. 14 , the F estimated from the Santa Inês breed was lower. Negative inbreeding coefficients occur when the number of observed homozygous loci is lower than the expected, suggesting that the population is more heterogeneous than expected, perhaps due to the composite nature of the breed.
In the N e estimation process, genetic distance between markers was estimated by a fixed ratio across the whole genome of one Mb per centiMorgan (cM). Prieur et al. evaluated three different methods to transform the genetic distance in ovine, and concluded that the estimation process using CRIMAP software (v2.503) was more accurate 27 . However, Prieur et al. also verified that the ranking for r 2 and N e between breeds were not affected by the method used and mentioned that the LD estimator was not different between methods 27 .
The N e estimated herein was 96 in the current generation. Kijas et al. 15 observed N e equal to 520 in the Brazilian Santa Inês breed, however, in their study only 47 animals were used. Pedrosa et al. also estimated N e using pedigree information and found a relatively low value (76) in Santa Inês 26 . These differences in N e can be due to the number of animals used (395 vs. 47 vs. 17,097) and the source of relationship information (genomics vs. pedigree). Al-Mamun et al. found values of N e ranging from 140 (Border Leicester breed) to 348 (Merino breed) 13 . Brito et al. 12 found values of N e in the most current generations in multi-breed sheep populations ranging from 125 to 974. Using a Spanish Churra sheep population, García-Gámez et al. 28 and Chitneedi et al. 29 estimated N e equal to 159 and 83, respectively.
The presence of artificial selection in the population under study was verified through the reduction of N e over the generations. In this study, N e ranged from 1,705 to 28,191 between 16 and 296 generations, respectively, before the current generation. Mastrangelo et al. estimated the N e at 295 generations ago to be 747 animals in Barbaresca sheep 30 . Liu et al. observed N e equal to 4,472 and 160 at 2,000 and 5 generations ago, assuming that one Mb is equivalent to one cM 14 . Brito et al. 12 reported estimates of effective population size of 5,537 animals 1,000 generations ago to 687 in the most recent generation. We hypothesize that the large difference in N e between the current and historic generations could be because the breeds that comprise the composite breed of Santa Inês were divergent historically and, thus, these estimates include multiple divergent breeds. The Santa Inês breed is relatively new, having only begun in the 1950s by non-systematic crossing of the Brazilian Somali, Bergamasca and Morada Nova breeds 31 . This illustrates that the large estimates of historic N e reflect time points before the formation of the breed, and even before the domestication of ovine.
We also estimated the N e for each chromosome. Chromosome 6, OAR6, exhibited the smallest N e , which was in contrast to the results of Liu et al. that reported the smallest N e for OAR10 14 . Linkage disequilibrium analysis between adjacent SNPs. The average (SD) r 2 and |D′| values esti-  13 . In the Barbaresca sheep breed, the mean r 2 across autosomes was 0.215, with an average distance between adjacent SNP pairs of 0.063 Mb 30 .
A study published with multi-breed sheep reported mean (SD) r 2 of 0.26 (0.100) 12 . The estimates of r 2 are relatively consistent across sheep populations, with the exception of larger r 2 values reported by Brito et al. Nevertheless, we should consider that the distance between markers was much shorter in Brito et al. than herein (4.74 kb versus 70 kb in the present study), which can be one reason for the increase in r 2 . Additionally, Brito et al. reported LD levels less than 0.10 for SNP located more than 0.04 Mb apart 12 . A recent study from Michailidou et al. observed a mean r 2 equal to 0.121, 0.098, and 0.092 in Boutsko, Chios, and Karagouniko, respectively, with the average intermarker distance 0.27 Mb for all breeds 34 .
Sheep populations have been associated with lower levels of LD in comparison to other ruminant and nonruminant species. Although the comparison between species is difficult due differences in genome size as well as the quality control applied, mean values between adjacent SNPs of 0.32 (r 2 ) and 0.69 (|D′|) were estimated from the  35 . The average LD (SD) between adjacent SNP within the same chromosome ranged from 0.135 (0.1972) to 0.194 (0.2423) for r 2 and 0.568 (0.3391) to 0.650 (0.3368) for |D′| (Table 1). Chromosomes 6,11,12,14,17,20,21,23 and 24 had lower average LD using r 2 lower than the 0.16 threshold 24 . Considering r 2 metrics between adjacent SNPs, chromosomes 2, 10 and 16 had higher levels of LD compared to other chromosomes. The high level of LD present on OAR10 was similar to that observed by Al-Mamun et al. 13 .
Linkage disequilibrium analysis among all pairwise SNPs. The average (SD) for r 2 and |D′| estimated between all pairwise SNPs on the 26 autosomal chromosomes were 0.018 (0.032) and 0.225 (0.213), respectively. In a study which used microsatellite markers to evaluate LD using chromosomes 1-10 of domestic sheep (Ovis aries) with mean distance between markers ranging from 10 to 40 Mb, a mean (SD) value of 0.211 (0.004) for |D′| was estimated 10 4 . Considering the confidence interval obtained for the estimates presented in this study as well as in the studies previously reported, it is possible to assume that estimates of r 2 and |D′| across all SNP combinations on a chromosome are relatively consistent across sheep populations. Figures 1 and 2 illustrate r 2 and |D′|, respectively, as a function of the intermarker distance for chromosomes 1 and 24. Supplementary Fig. S1 and S2 depict r 2 and |D′|, respectively, for the other chromosomes. Overall, the relationship between LD and intermarker distance suggest that as intermarker distance decreases, LD increases. A notable exception is chromosome 1. On this chromosome, r 2 presented secondary high peaks around the interval from 100 to 150 Mb (Fig. 1). On all chromosomes, |D′| maximum was observed between many SNP pairs with  high intermarker distances (Fig. 2). We contend that this might occur due to the dependence of |D′| on allele frequency. The unexpected increase in LD between some SNP pairs with larger intermarker distances could also be explained by selection. It is possible that favorable alleles for different traits were selected, resulting in a high degree of LD on longer intermarker distances, even extending to inter chromosome pairs of SNP. Another potential reason for high r 2 values when intermarker distance was large is assembling errors, potentially explaining the phenomenon on chromosome 1.
The average (SD) r 2 between all pairwise SNPs contained on the same chromosome with intermarker distance greater than or equal to 0. 10  Using LD categories defined by Espigolan et al., Table 2 shows the average intermarker distances between pairwise SNPs exhibiting low LD (r 2 ≤ 0.16), medium LD (0.16 < r 2 < 0.70), and high LD (r 2 > 0.70) 24 . Higher levels of r 2 (greater than 0.70) were found at distances between markers smaller than 0.768 Mb with 3,296 combinations of SNPs (0.01% of all combinations). For medium levels of r 2 (0.16 to 0.70), distances lower than 5.277 Mb were observed with 273,659 combinations of SNPs (0.849%). Considering low levels of r 2 (lower than 0.16) distances found were higher than 15.110 Mb with 31,939,376 combinations of SNPs (99.140%).
Relationship between linkage disequilibrium, inbreeding coefficient and effective population size. The relationships between r 2 , |D′|, MAF, F, and N e are reported in Table 1. The mean MAF was similar across all chromosomes. The correlation between the two measures of LD was 0.75 when LD was estimated between adjacent SNP and 0.97 when estimated among all pairwise SNP. Although |D′| tends to overestimate LD values compared to r 2 as reported by Zhao et al. 37 , both LD metrics exhibited the same behavior (Table 1). This is expected since these metrics are defined similarly as a function of allele frequency. The differences between the two metrics (r 2 and |D′|) are related to the weight applied to the allele frequencies. Given |D′| is entirely dependent on the frequency of the alleles, |D′| possibly inflates LD estimates 37 . On the other hand, the r 2 proposed by Hill and Robertson 7 aims to reduce this frequency dependence.

High
Medium According to Hill and Robertson 7 , LD (numerator of r 2 ) and F have a linear relationship as shown in the equation below 7 . In a population under selection, the number of homozygotes tends to increase for many favorable alleles. Consequently, the inbreeding coefficient and LD between these selected alleles increase 7 .
and is the numerator of r 2 , ρ A is the probability of allele A at marker 1, ρ B is the probability of allele B at marker 2, and ρ AB is a probability of the pair of AB markers; p 0 and q 0 are the frequency of A and B alleles, respectively, in generation zero or with initial equilibrium. A positive relationship (0.22) was observed between the D 2 estimated by equation (1) as a function of inbreeding coefficients and the average D 2 observed between adjacent SNPs on each the chromosome. A possible justification for the low correlation could be the relatively limited number of SNPs per chromosome on the panel used in the current study. The SNPs contained on the panel used herein covers only 299.6 Mb out of a total of 2,615.52 Mb, equivalent to 11% of the sheep genome. However, a few negative values were observed (e.g., −0.08) when estimating the correlation between D 2 estimated by F (equation (1)) and average D 2 between all pairwise SNPs on the chromosome. Additionally, equation (1) was derived under the assumption of finite and natural populations 7 .
The expectation of D at generation t can be derived from c (the recombination rate) and N e . This is given by 38 : A negative correlation between D, which is the numerator of |D′|, and both r 2 and effective size (N e ) is expected. Considering N e as an indicator of selection, lower N e values are a result of high selection pressure, and consequently a reduction in the number of breeding animals and genetic diversity. A negative relationship between average LD between all pairwise SNPs on a chromosome and N e was observed (−0.16), as expected. However, the correlation between average LD between adjacent SNPs and N e was positive (0.35). One potential reason for the observed discrepancy is the fact that N e was estimated based on the LD between all pairwise SNPs rather than LD between adjacent SNPs. For instance, Lindblad-Toh et al. also observed that the effective population size and the inbreeding coefficient were reduced during dog domestication, resulting in a decrease of LD 39 .
Haplotype blocks. The construction of haplotypes with only two (frequency = 1,879) to twenty-one (frequency = 1) markers was consistent with the low LD among pairwise SNP reported in this study. The mean size of haplotype blocks and the frequency of the number of SNPs for each chromosome are reported in Table 3. Short haplotype blocks in common among breeds have been observed by others 17 . The average distance (SD) between markers that formed the haplotype blocks was 0.04 (0.033) Mb. Considering the size of the sheep genome and the average distance between SNP that formed the haplotype blocks, it was possible to indirectly infer the minimum number of markers needed for genomic analyses, which was 61,415 SNPs. However, due to the high standard deviation of the distance between markers that formed the haplotype, it is important to use this number with caution.

Conclusions
The extent of LD among adjacent markers for the Santa Inês breed resembled those of previously reported results in other breeds of domesticated sheep. The mean LD values between all SNP pairs on each chromosome were consistent with domestic and wild sheep (Ovis canadensis and Ovis dalli) and they were lower than the estimates reported in other species. The findings reported in this study will be useful to provide a theoretical reference in determining the number of markers needed for future GS and GWAS in Santa Inês sheep.

Methods
Animal resources, genotyping and quality control. All experimental procedures employed in the present study that relate to animal experimentation were performed in accordance with the resolution number 07/2016 approved by Institutional Animal Care and Use Committee Guidelines from the School of Veterinary Medicine of University Federal of Bahia -UFBA and sanctioned by the president Prof. Claudio de Oliveira Romão to ensure compliance with international guidelines for animal welfare.
The dataset included the genotypes of 396 animals from the Santa Inês sheep breed collected between 2016 and 2017. These animals were fed in confinement for 54 to 92 days on average, during four different periods with slightly different nutritional management. This herd is located at the Experimental Farm of São Gonçalo dos Campos, the city of São Gonçalo dos Campos, Bahia, Brazil, and it is associated with the Federal University of Bahia (UFBA).
To characterize the Santa Inês sheep population, the relationship between animals was estimated using a genomic relationship matrix, G, as described in VanRaden (2008) 40 . The G matrix was constructed by using the PREGSF90 software in the BLUPF90 package [41][42][43] . The average relationship between animals (SD) was 0.001 (0.0634), with minimum and maximum values equal to −0.135 and 0.934, respectively. The hierarchically clustered heatmap of the G matrix was constructed using the gplots R package 44 and is presented in Fig. 3. The heatmap represents the relationship among individuals, with darker shades (red) representing low relationship between animals and lighter tones (light yellow) representing a high degree of relationship. The blocks observed in the heatmap represent individuals with stronger degrees of relationship than the overall mean relationship. By analyzing each block, we observed an overall relationship mean (standard deviation) within all blocks equal to 0.004 (0.0606), varying from −0.023 (0.0291) to 0.079 (0.1514). Random blocks with darker tones within the SCiENtiFiC RePORTS | (2018) 8:8851 | DOI:10.1038/s41598-018-27259-7 Fig. 3, for example, showed a lower mean (standard deviation) degree of relationship, with value equal to 0.001 (0.0555). None of the blocks can be considered as an exclusively full-sib or half-sib group 45 , although they include full-sib and half-sib relationships. Inside the most defined diagonal block, for example, 13 full-sib animal pairs and 350 half-sib animal pairs are represented. In the population as a whole, there are one twin animal pair, 38 full-sib animal pairs and 3,089 half-sib animal pairs. The structure of this population can be observed by a distribution printed into the left of Fig. 3, which presents the frequency of pairs by relationship degree. The major density of animal pairs is near zero, representing the overall low relationship among them. It is also possible to observe higher density of animal pairs above zero, closely to 0.25, 0.5 and 1.0, representing the half-sibs, full-sibs and twins as well as a mass lower than zero. The genetic structure of sampling might influence the LD results. For instance, a population with an elevated level of relationship probably will also have a higher level of inbreeding and, consequently, a higher LD level. Therefore, the complex breeding history of Santa Inês may have influenced the estimates of LD.
DNA was extracted from tissue samples of the Longissimus dorsi muscle collected from the left hemi-carcass and stored in 2.0 milliliter (ml) Eppendorf tubes. DNA extraction was performed according to protocols for lysis buffer and RNase. A high-density SNP panel (Illumina High-Density Ovine SNP BeadChip ® ) containing 54,241 SNP was used for genotyping. Chromosomal coordinates for each SNP were obtained from the ovine genome sequence assembly, Oar_v3.1.
Quality control (QC) of the genomic data was performed by the GenABEL R package 46 for LD analyses 47 . The PREGSF90 interface of the BLUPF90 program [41][42][43] was used to edit the genomic data for F, N e , MAF, and haplotype analyses. SNPs with a call rate lower than 0.90, MAF lower than 0.05 and p-value lower than 0.1 for the Hardy-Weinberg Equilibrium Chi-square test were excluded. One sample with a call rate lower than 0.9 was also removed. Table 4 summarizes the number of SNPs per chromosome before and after QC. We considered only the autosomal chromosomes (OAR1 to OAR26) in this study resulting in 38,168 SNPs retained for further analysis.
Inbreeding coefficient and effective population size. Inbreeding coefficient (F) was calculated as a function of the expected and observed homozygote difference by using the PLINK software 48 . This is given by  where F i is the estimated inbreeding coefficient of the i ih animal; O i is the number of homozygous loci observed in the i ih animal, E i is the number of homozygous loci expected and L i is the number of genotyped autosomal loci 48 .
Effective population size (N e ) was obtained by the SNeP software 49 . This software provides a history of the effective population size, that is, the number of past generations based on the relationship between N e , linkage disequilibrium represented by r 2 , and recombination rate (c) by using the following equation 50 Therefore, by solving equation (4), we have: where N e t ( ) is the effective population size at generation t, which is − f c (4 ( )) t 151 ; c t is the recombination rate in generation t which is proportional to the physical distance between markers, r 2 is LD, and α is the adjustment for mutation rate. The parameter α can assume three different values: 1, 2 or . 2 2 52 . When we consider α equal to 1, N c e tends towards 0 and we assume that there is no selection or mutation. On the other hand, when mutation does occur, the parameter α can be equal to 2 or 2.2. The value of 2.2 comes from the result of the equilibrium expression that was equal to 5 11 . In this expression, ρ A is the probability of allele A at marker (or SNP) 1, ρ B is the probability of allele B at marker (or SNP) 2, and ρ AB is a probability of the pair of AB markers; following Ohta & Kimura 52 . Tenesa et al. proposed α equal to two 53 . In our study, the N e by chromosome was the result of a harmonic mean due to a relatively small number of SNPs in each chromosome. The physical distance was transformed to genetic distance considering one Mb as one centimorgan (cM). Linkage disequilibrium analysis. The estimation of LD was performed in two ways for each chromosome: (1) between neighboring pairs of SNPs (adjacent SNPs) and (2) pairwise combination of all SNPs (pairwise SNPs) using the function LD in the R package genetics 47,54 . The |D′| is a scale of the frequency difference of the allele pairs AB, where A is the allele of the marker (or SNP) 1, and B the allele of the marker 2, and the expected frequency of each allele separately. |D′| parameter ranges from 0 to 1 and it is given by 55 Here ρ A is the probability of allele A at marker 1, ρ a is the probability of allele a at marker 1, ρ B is the probability of allele B at marker 2, ρ b is the probability of allele b at marker 2, and ρ AB is a probability of the pair of AB markers. Maximum likelihood was used to estimate ρ AB because genotype AB/ab is not distinguishable from genotype aB/Ab 56 .
The squared correlation between the markers, given by r 2 , is expressed as 7 : Haplotype blocks. The haplotype blocks were identified by following the approach suggested by Gabriel et al. 57 which was implemented via PLINK 48 . Blocks were partitioned according to whether the upper and lower confidence limits on estimates of pairwise |D′| measure fall within certain threshold values. The desired SNP panel density was estimated by the ratio of the megabase pair over the entire ovine genome and distance between markers that composed the haplotype blocks.
Data availability. Data are available on request.
Declarations. All experimental procedures involving sheep were approved by the Institutional Animal Care and Use Committee Guidelines from School of Veterinary Medicine of University Federal of Bahia -UFBA and sanctioned by the president Prof. Claudio de Oliveira Romão (n° 07/2016). All experiments were performed in accordance with relevant guidelines and regulations.