Identification of loci associated with pathological outcomes in Holstein cattle infected with Mycobacterium avium subsp. paratuberculosis using whole-genome sequence data

Bovine paratuberculosis (PTB), caused by Mycobacterium avium subsp. paratuberculosis (MAP), is a chronic granulomatous enteritis that affects cattle worldwide. According to their severity and extension, PTB-associated histological lesions have been classified into the following groups; focal, multifocal, and diffuse. It is unknown whether these lesions represent sequential stages or divergent outcomes. In the current study, the associations between host genetic and pathology were explored by genotyping 813 Spanish Holstein cows with no visible lesions (N = 373) and with focal (N = 371), multifocal (N = 33), and diffuse (N = 33) lesions in gut tissues and regional lymph nodes. DNA from peripheral blood samples of these animals was genotyped with the bovine EuroG MD Bead Chip, and the corresponding genotypes were imputed to whole-genome sequencing (WGS) data using the 1000 Bull genomes reference population. A genome-wide association study (GWAS) was performed using the WGS data and the presence or absence of each type of histological lesion in a case–control approach. A total of 192 and 92 single nucleotide polymorphisms (SNPs) defining 13 and 9 distinct quantitative trait loci (QTLs) were highly-associated (P ≤ 5 × 10−7) with the multifocal (heritability = 0.075) and the diffuse (heritability = 0.189) lesions, respectively. No overlap was seen in the SNPs controlling these distinct pathological outcomes. The identified QTLs overlapped with some QTLs previously associated with PTB susceptibility, bovine tuberculosis susceptibility, clinical mastitis, somatic cell score, bovine respiratory disease susceptibility, tick resistance, IgG level, and length of productive life. Pathway analysis with candidate genes overlapping the identified QTLs revealed a significant enrichment of the keratinization pathway and cholesterol metabolism in the animals with multifocal and diffuse lesions, respectively. To test whether the enrichment of SNP variants in candidate genes involved in the cholesterol metabolism was associated with the diffuse lesions; the levels of total cholesterol were measured in plasma samples of cattle with focal, multifocal, or diffuse lesions or with no visible lesions. Our results showed reduced levels of plasma cholesterol in cattle with diffuse lesions. Taken together, our findings suggested that the variation in MAP-associated pathological outcomes might be, in part, genetically determined and indicative of distinct host responses.

Genes within this region such as the NFKB Inhibitor Epsilon (NFKBIE) control the response to several bacterial and viral pathogens as well as vaccines, possibly by influencing antibody production. The candidate genes associated with the multifocal lesions are novel in the sense that they have not been associated with PTB risk in cattle before except the Neuronal PAS domain protein 2 (NPAS2) a member of the basic helix-loop-helix (bHLH)-PAS family of transcription factors. NPAS2 is a core component of the circadian clock, an important regulator of a wide array of physiological functions including metabolism, sleep, body temperature, blood pressure, endocrine, immune, cardiovascular, and renal function. The GWAS defined 92 SNPs and 9 QTLs with a high association with the diffuse lesions located in 6 chromosomes (BTA1, BTA3, BTA7, BTA8, BTA13 and BTA23) ( Quantile-quantile plots and odds ratios. Quantile-quantile plots comparing the observed distribution of -log (P values) to the expected values under null hypothesis were generated. The plots showed a distribution close to the expected distribution for the following phenotypes: focal vs undetected lesions (λ median = 1.01689), multifocal vs undetected lesions (λ median = 1.04401), and diffuse versus undetected lesions (λ median = 1.03222), indicating that significant values were not overestimated due to population stratification or cryptic relatedness.   Table 2. QTLs surpassing the significance threshold (P < 5 × 10 −7 ) for evidence of an association with the multifocal lesions. 1 QTL location, 2 SNP location in the genome assembly, 3 Positional candidate genes are defined as genes that are located within 50 kb on either side of the identified QTL. www.nature.com/scientificreports/ A slight deviation in the upper right tails from the y = x line suggested that some associations were present in the data. The regression coefficients (b-values) for all the SNPs associated with the presence of multifocal and diffuse lesions were all positive suggesting a positive correlation between exposure and outcome (data not shown). In Table 3. Quantitative trait loci (QTL) surpassing the significance threshold (P < 5 × 10 −7 ) for evidence of an association with the diffuse lesions. 1 QTL location, 2 SNP location in the genome assembly, 3 Positional candidate genes are defined as genes that are located within 50 kb on either side of the identified QTL.  www.nature.com/scientificreports/ addition, the odds ratios (OR) calculated for all the SNPs associated with the presence of multifocal and diffuse lesions (P < 5 × 10 −7 ) were > 1 indicating that the animals with the minor alleles had a higher risk of developing multifocal or diffuse lesions (Supplementary Table 3).
Gene ontologies and metabolic pathways. Functional categorization of the candidate genes for each phenotype (presence or absence of multifocal and diffuse lesions) was performed using the Bioconductor GOseq package. We identified 7 GO and 1 metabolic pathway significantly enriched in the animals with multifocal and diffuse lesions, respectively. As showed in

Antikeratin antibodies (AKA) and cholesterol plasma levels.
To test whether a positive correlation between the levels of AKA in plasma and the presence of multifocal lesions existed, AKA levels were measured in plasma samples of cows with multifocal lesions and without visible lesions. No significant differences (P = 0.2) were found between the levels of AKA in cows with multifocal lesions (1.44 ng/ml) in comparison with cows with no visible lesions (2.10 ng/ml). As presented in Table 4, enrichment of candidate genes controlling the cholesterol metabolism (bta04979) was identified in cows with diffuse lesions. To test whether the enrichment of SNP variants in candidate genes caused dysregulation of the cholesterol metabolism, the levels of total cholesterol were measured in plasma samples from cattle with focal, multifocal or diffuse lesions or with no visible lesions. As seen in Fig. 2, there was a significant decrease in total cholesterol in plasma samples from cows with diffuse lesions (0.080 µg/µl) compared to those with focal (0.126 µg/µl; P ≤ 0.001), or multifocal (0.141 µg/µl, P ≤ 0.001) lesions, or with no visible lesions (0.129 µg/µl, P ≤ 0.001).

Discussion
PTB is a multifactorial disease that arises as the result of the interaction of genetic, environmental, and microbial factors leading to the various PTB outcomes. Genetic factors are likely to play an important role in PTB pathogenesis. Cattle infected with MAP display various types of lesions with distinct severity but the associations between host genetic and PTB-associated pathology had not been explored before. Although previous GWAS identified loci associated with MAP tissue infection assessed by PCR and culture 15,[28][29][30] , our study provides the first comparison of the genetic variants associated with three distinct PTB-associated lesions (focal, multifocal and diffuse) in Spanish Holstein cattle (N = 813). Our study is unique in the definition of cases and controls through the histopathological analysis of PTB-associated lesions. Using WGS data, the h 2 estimates were calculated for the presence or absence of PTB-associated histopathological lesions in Spanish Holstein cattle (N = 813). When contrasting these two PTB outcomes, we couldn´t identify any SNP surpassing the FDR < 0.05. However, the stratification of PTB-associated pathology in three categories allowed the identification of SNPs surpassing the FDR < 0.05 and increased the h 2 of the diffuse lesions (h 2 = 0.189) when compared with phenotypes such as the ELISA-tissue PCR-tissue culture (h 2 = 0.139) that typically detect animals with diffuse lesions 15 33 , the somatic cell score is included in selective breeding programs in many countries, including Spain. The GWAS did not identify any SNP associated with the focal lesions. This could be attributed to the lack of discrimination between no visible and focal lesions. Using RNA-Seq technology we previously identified a biomarker of PTB progression, the precursor of the bovine intelectin 2 (ITLN2), which was overexpressed in ICV samples of animals with focal (log 2 fold-change = 10.6) and diffuse (log 2 fold-change = 6.8) lesions compared with animals without visible lesions 34 . More recently, we have demonstrated that the quantification of bovine ITLN2 secreting cells by immunohistochemical analysis of ICV sections could constitute a good post-mortem tool, complementary to the histopathology, to improve the detection of focal lesions which may sometimes be difficult to detect 35 . To test whether the focal lesions were controlled by specific genetic variants, a GWAS comparing animals with focal versus multifocal and focal versus diffuse lesions was performed (data not shown). A total of 28 and 590 SNPs were specifically associated with each comparison (FDR < 0.05), respectively. These results revealed significant differences in the genetic variants associated with the presence of focal lesions when compared with the multifocal or diffuse lesions. Further GWAS using diagnostic methods allowing a better discrimination of the animals with focal and no visible lesions are needed.
A total of 192 and 92 SNPs defining 13 and 9 distinct quantitative trait loci (QTLs) were highly-associated with the multifocal and diffuse lesions, respectively. All these SNPs had a positive b-value and OR > 1 and were, therefore, associated with each pathological outcome. No overlap was seen in the SNPs associated with each type of lesion which suggested that distinct genetic variants might control the multifocal and diffuses lesions and that these lesions might represent divergent disease outcomes. Further blinded studies are required to validate the associations between the identified SNPs and corresponding pathological outcomes in an independent population. Although the functional effects of the identified candidate genes were assigned through GO and KEGG pathways, further functional studies need to be performed on selective SNPs to determine if they are affecting the positional candidate genes or other genes through regulatory effects. Pathway analysis with candidate genes overlapping the lesions-associated QTLs revealed a significant enrichment of variants controlling the intermediate filament cytoskeleton and cholesterol metabolism in the animals with multifocal and diffuse lesions, respectively. Our findings provide for the first time a potential link between genetic variants in KRT genes (KRT5, KRT7, KRT72, KRT73, KRT74, KRT75, KRT80, KRT81, and KRT83) and PTB. Interestingly, abnormal KRT mutations have been associated with IBD 36 . It was previously reported the sharing of mimicking B cell epitopes between M. leprae and the host KRT7 37 . Similarly, we hypothesized that molecular mimicry between putative epitopes of KRT and MAP might be playing a role in the development of PTB-associated multifocal lesions. However, no correlation between high levels of AKA in plasma and cows with multifocal lesions was observed. In the other hand, KRTs, the major subgroup among the intermediate filament family of cytoskeletal proteins, are responsible for maintaining the stability and integrity of the gastrointestinal epithelium, for providing tissue resilience against many stimuli, and regulating various cellular functions such as cellular proliferation, migration, differentiation as well as inflammatory and immune responses 38 . In a granulomatous experimental model using the www.nature.com/scientificreports/ teleost fish Piaractus mesopotamicus infected with Bacillus Calmette-Guérin (BCG), stronger immunostaining with anti-cytokeratin antibodies was observed at 33 days p. i. when the epithelioid cells were more evident and the granulomas were fully formed 39 . In tuberculoid (TT) and borderline tuberculoid leprosy, epithelioid noncaseinated granulomas encapsulated by many lymphocytes predominate and acid-fast bacilli are absent or only rarely present 40 . The presence of epithelioid granulomas with multifocal distribution in TT leprosy leads to the control of M. leprae replication and the containment of its spread 41 . Further immunohistochemical studies with a bovine anti-cytokeratin antibody are needed to quantify the number of KRT-stained cells in PTB-associated lesions and to reveal the potential role of KRT in maintaining tissue resilience in animals with multifocal lesions. While KRT variants were associated with cows with multifocal lesions, genetic variants in candidate genes involved in the cholesterol metabolism were enriched in animals with diffuse lesions, thereby suggesting that cholesterol variants associated with disease progression. To test whether the enrichment of SNP variants in candidate genes (NCEH1/LDLR/ANGPTL8/ANGPTL4) involved in the cholesterol metabolism was associated with the diffuse lesions, the levels of total cholesterol were measured in plasma samples of cattle with focal, multifocal or diffuse lesions or with no visible lesions. Our results showed reduced levels of plasma cholesterol in cattle with diffuse lesions (P ≤ 0.001). Similarly, inflammation has been associated with decreased total serum cholesterol levels in patients diagnosed with CD 42 and IBD 43 . This reduction might be due to impedance of absorption of cholesterol efficiently thought the thickened intestinal wall in individuals in advances stages of the infection. Using RNA-Seq, we previously observed that the cholesterol route (bta04977) was dysregulated in the ileocecal valve (ICV) of cows with diffuse lesions versus the control group with four upregulated genes matching this route (APOA1, APOC3, APOA4, APOB) 34 . In addition, the top upregulated gene in peripheral blood of animals with focal lesions versus control cows was the ATP-Binding Cassette Subfamily A Member 13 (ABCA13), a gene that accelerates cholesterol internalization and accumulation in intracellular vesicles 34,44,45 . In agreement with our data, recent evidence suggests that MAP-infected macrophages accumulate intracellular cholesterol droplets and depict a foam cell phenotype during infection providing an enriched environment for MAP survival 46,47 . Altogether, these findings suggest increased cholesterol transport, internalization and hydrolysis in macrophages of PTB-infected animals with diffuse lesions, which may invoke a localized compensatory mechanism to increase cholesterol synthesis at the site of the infection. It is well known that one of the main clinical signs associated to clinical PTB is decreased milk production. Animals with diffuse lesions have lower milk production and milk fat content when compared with animals with other lesions and uninfected cows 48 . If an animal shows decreasing plasma cholesterol levels that may be considered an indication that it is likely to have decreased milk production and progressing toward clinical disease.

Conclusions
While PTB-associated multifocal lesions lead to a localized disease, the diffuse lesions present a disseminated form with high bacterial loads. Our results suggested that the genetic variants associated with the presence of focal, multifocal and diffuse lesions were distinct. While keratin variants associated with the multifocal lesions, cholesterol variants associated with the diffuse, more severe lesions. It follows that there are at least two distinct disease outcomes, which might be indicative of different host responses. Rearrangements in the keratin filaments and cholesterol metabolism are predominantly gearing this disease outcome polarization.

Materials and methods
Ethics statement. Animals used in this study were not submitted to any in vivo experimentation before stunning for slaughter and, therefore, no specific ethics committee authorization was required. The cows were slaughtered in the Bilbao and Donostia municipal slaughterhouses (Basque Country, Spain) under the pertinent Basque Histopathological examination. Post-mortem tissue sampling was performed as previously described 49 .
Briefly, samples from ileocecal lymph node, jejunal lymph node, ICV, jejunum, and terminal ileum were collected aseptically from each animal and placed in formalin within 24 h after arrival at the laboratory. The samples were preferentially taken from areas of the preselected tissues that showed gross lesions, thickened mucosa and enlarged lymphatic nodes, if present. The samples were fixed in 10% neutral buffered formalin for 72 h, dehydrated through graded alcohols and xilol, embedded in paraffin, and cut into 4 μm sections. Sections were, mounted on treated microscope slides, stained with haematoxylin and eosin (HE) and with Ziehl-Neelsen (ZN) and examined for pathological lesions and for the presence of acid-fast bacteria, respectively. According to their location and extension, inflammatory cell type, and mycobacterial load, PTB-associated histopathological lesions were classified in focal, multifocal, and diffuse lesions as previously described 13 . Genotyping and imputation to WGS. Peripheral  Variance components and h 2 estimates. The variance components, standard errors (SE), and h 2 estimates explained by all the SNPs were calculated using the genome-wide complex trait analysis (GCTA) software 1.93.2, according to the following formula:h 2 = σ2 G σ2 G+σ2 e where σ G is the additive genetic effect of the individuals and σ e is the residual variance 53 . The variance components σ G and σ e in the equation were estimated by the genomic-relatedness-based restricted maximum-likelihood (GREML) approach implemented in GCTA1.93.2. The concept behind this method is to fit all the SNPs simultaneously as random effects in a mixed linear model to estimate the phenotypic variance explained by all the SNPs.
Genome-wide association study. Associations between the imputed genotypes and the presence or absence of focal, multifocal or diffuse lesions was analyzed in a case-control study using the mlma (mixed linear model) association analysis of the GCTA 1.93.2 53 . Briefly, the model is y = a + bx + g + e, where y is the phenotype, a is the mean term, b is the additive effect (fixed effect) of the candidate SNP to be tested for association, x is the SNP genotype indicator variable coded as 0, 1 or 2, g is the polygenic effect (random effect) i.e. the accumulated effect of all SNPs (as captured by the GRM calculated using all SNPs), and e is the residual. Control cows did not have visible lesions in gut tissues and had a negative ELISA, tissue PCR and culture result at the time of slaughter. Age was included as a covariate in the analysis. After the GWAS, the SNPs with R 2 values higher than 70% were retained. To account for multiple testing, a 5% chromosome-wise false discovery rate (FDR) was used to determine the probability that the associations were not false positives. P values between P = 5 × 10 −5 and P = 5 × 10 −7 provided a moderate significance level (α), and P values < 5 × 10 −7 were used to identify highly significant associations (Welcome Trust Case Control Consortium, 2007). The inflation factor (λ) and quantile-quantile plots were calculated to compare observed distributions of -log (P values) to the expected distribution under the no association model for each phenotype. λ value close to 1 suggests appropriate adjustment for potential substructure and λ > 1.2 suggests population stratification. The regression coefficients (b) calculated using GCTA 1.93.2 represent the estimated increase in the log odds of the outcome per unit increase in the value of the exposure. In other words, the exponential function of b (e b ) is the OR-associated with a one-unit increase in the exposure. In addition, the OR and their 95% confidence intervals (CI) for the SNPs associated with the presence or absence of each type of histopathological lesion (P < 5 × 10 −7 ) were calculated using logistic regression analysis with the WGassociation function of SNPassoc 1.9.2 under five different genetic models (co-dominant, dominant, recessive, over-dominant, and log-additive) 54 . For each SNP and genetic model, the function WGstats of SNPassoc 1.9.2. provides genotype frequencies, OR, and 95% CI with the major homozygous genotype deemed as the baseline. SNPs, QTLs, and candidate genes identification. The location of the significant SNPs was determined with biomaRt 2.44.1 for R 55 using the ARS-UCD1.2 reference genome. The genomic distribution of the identified SNPs was determined using the Ensembl Variant Effect Predictor (VEP). QTLs associated with each type of histological lesion were defined based on SNPs on linkage disequilibrium patterns with SNPs that surpassed the suggestive significance threshold (P < 5 × 10 −7 ) in a given chromosome. The beginning and end of each QTL were defined in a window of 500,000 base pairs upstream and downstream by the SNPs that were furthest upstream and downstream of the suggestive SNP. Overlapping QTL regions were merged and considered as a single QTL. The defined QTL regions we further investigated for the presence of candidate genes within 50,000 base pairs to each side of the defined QTL using Ensembl (https:// www. ensem bl. org). The identified QTLs and candidate genes were compared with the reported cattle QTLs and candidate genes for animal diseases including PTB, bovine tuberculosis, clinical mastitis and bovine respiratory disease (http:// www. anima lgeno me. org).
Gene ontology and metabolic analysis. Candidate genes within significant QTLs were investigated for significant enrichments of GO categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways using the cluster Profiler Bioconductor package 3.10.1 56,57 . Briefly, the ClusterProfiler package offers a gene classification method, namely groupGO, to classify genes based on their projection at a specific level of the GO corpus, and provides functions, enrichGO and enrichKEGG, to calculate enrichment test for GO terms and KEGG pathways based on hypergeometric distribution. To prevent high false discovery rate (FDR) in multiple testing, P adjust values were estimated. GO analysis provides categories of genes involved in different biological processes (BP), molecular functions (MF), and those integral for different cell compartments (CC). www.nature.com/scientificreports/ Samples were washed out twice with PBS and then bovine AKA antigen (100 µl) was added. Plates were incubated at 37 °C form 60 min. ELISA plates were washed 3 times and an avidin-peroxidase conjugate (100 µl) was added to each well. The plates were sealed and incubated at 37 °C for 30 min. The enzyme conjugate was washed out of the wells five times with PBS and a TBM substrate (100 µl) was used for coloration. TBM reacts to form a blue product from the peroxidase activity, and finally turns to yellow after addition of 100 µl per well of the stop solution (Color Reagent C). Plates were read at 450 nm in an ELISA reader (Thermo Scientific Multiskan, US).
The OD values of each sample and standard had the values of the blank subtracted. We averaged the duplicate OD readings for each standard and sample. A standard curve was generated by plotting the mean OD values of each standard on the vertical axis and the corresponding concentration on the horizontal axis. The AKA concentration levels in each sample were interpolated from the standard curve. Statistical analysis was performed using an unpaired t-test for comparison between two groups (GraphPad Prism 8, San Diego, California, US). Differences were considered significant when P ≤ 0.05. Each reaction requires 2 µl of plasma which was brought to a final volume of 50 µl with cholesterol buffer. An equal volume (50 µl) of total cholesterol reaction master mix, containing cholesterol esterase, was added to each well and placed on a horizontal shaker for 5 min, after which the plate was incubated in the dark a 37 °C for 1 h. Following incubation, absorbance was measured at 570 nm using an ELISA reader (Thermo Scientific Multiskan, US). Cholesterol concentration (µg/µl) present in the plasma samples was calculated by comparison against a serial dilution of cholesterol standards provided with the kit. All samples and standards were run in duplicate. Statistical analysis was performed using an unpaired t-test for comparison between groups. Differences were considered statistically significate when P ≤ 0.05.

Cholesterol quantitation. A Cholesterol
Preprint. This article was submitted to an online preprint archive 58 .

Data availability
The sequence variants for 1800 animals from Run8 are public at the European Variation Archive under project number PRJEB42783. The SNPs highly-associated with MAP-associated pathology have been deposited in the animal genome database (https:// www. anima lgeno me. org/). Individual genotype data used in this study is managed by a third party, the Spanish Friesian Cattle National Federation (CONAFE