Introduction

The transcriptome-wide association study (TWAS) is an emerging gene-based association with phenotype that leverages the fine mapping of expression quantitative trait loci (eQTLs) and identification of causal genes with higher power than conventional genome-wide association studies (GWASes). TWAS uses the predicted gene expression levels as predictor variables affecting the phenotype variance unlike the GWAS that uses genotypes as the causal variables1. In the era of multi-“Omics”, integration of genome-wide SNP genotypes and transcriptome information to achieve fine mapping of causal variants for complex traits has been an essential step. Though its importance is evident, yet fewer studies have been conducted for complex economic traits in dairy animals. Majority of such studies focusses on identification of eQTLs based on GWAS2,3 and4 and post-GWA study on differentially expressed genes (DEGs) and weighted gene co-expression network analysis (WGCNA)5. However, the majority of GWAS-hit loci lie in non-coding regions6 and even though they might play a role in gene expression regulation, its physiological perspective is unclear. In dairy animals, most of the variants contributing to complex lactation traits have not yet been identified due to a limit on detection of true positives through GWAS in small sample sizes. Since the conception of TWAS7, several studies have been conducted in humans for psoriasis8, Depression9, hematological traits10, and Alzheimer’s disease11 etc. Apart from that, TWAS has been performed in maize12, and in pigs for meat quality traits13.

Several studies in humans indicate the superiority of TWAS power over GWAS when expression heritability (he2) varies between 0.04 and 0.2 and proportion of variance explained by causal cis-SNPs is low1,14. The advantage of TWAS lies in its statistical power in identifying a causal gene with much lesser sample size than GWAS and its robustness to incorporate both individual and summary statistics of GWASes15. Once the gene expression prediction model is developed, it can fit across studies and tissues that further increase the prediction accuracy. Riverine buffalo (Bubalus bubalis) being a major dairy animal with ~ 45% contribution to the national milk production in India, still remains aloof from the genomic research on its genetic architecture4. Integrating transcriptome and genome-wide SNP information shall help in delineating the causal genes for lactation traits in one of the major dairy breed of buffaloes i.e., Murrah. The present study aims at devising a suitable gene expression prediction model based on SNP genotypes and to associate the predicted expression levels with various lactation traits such as 305 days milk yield, peak yield, wet average and milk composition traits like fat and SNF% in Murrah buffalo. The scope of this study is identification of causal genes and cis-SNPs with higher effect sizes on these causal genes for the studied traits such that they can further be used in genomic selection and improvement programs.

Materials and methods

Lactation records of 144 randomly selected female Murrah buffaloes from Livestock Research Centre (LRC), National Dairy Resaerch Insitute (NDRI), Karnal, India (29.68°N and 76.99°E) were collected for the present study. 1st lactation records of 305 days milk yield (305 DMY), peak yield (PY), wet average (WA), fat percentage (fat%), Solid-not-fat percentage (SNF%), birth weight (bwt) in kg, and age at first calving (AFC) in months for 144 selected buffaloes which had completed their first lactation with a standard lactation length of 305 days or more were recorded. Generally, the animals are stall-fed and as let-down ration, 0.25 kg of additional concentrate is given at the time of milking. Green fodder and other roughages are provided in ad-libitum. All the buffaloes are exclusively stall fed in open paddocks at the LRC, NDRI.

The animal study was reviewed and approved by the ICAR-National Dairy Research Institute (IAEC). All experiments were performed in accordance with the guidelines and regulations of IAEC, ICAR-NDRI.

Generation of genotype information

Blood sample from 144 randomly selected animals were collected aseptically and DNA was isolated via Phenol–Chloroform method following protocol of Sambrook and Russell16. Quality of DNA was checked using agarose gel electrophoresis and quantity was assessed using Qubit 4.0 fluorometer. DNA double digestion with SphI and MluCI restriction enzymes was carried out for standard restriction-associated DNA (RAD) protocol as described by17. Standard Illumina read multiplexing protocol was followed with adapters (P1 and P2). After adapter ligation and size selection, samples were sequenced on Illumina Hi-seq 2000 platform and 150 bp paired end reads were generated with ~ 1X coverage. Index and sequence dictionary files for reference genome retrieved from NCBI website were created using the Burrows–Wheeler algorithm (BWA)18 and PicardTools, respectively. The quality of paired-end raw FASTQ files generated after sequencing, was checked using FastQC19. Adapters were marked and trimmed using bbmap20. The BWA-MEM algorithm was used to align the trimmed FASTQ sequences with the reference genome. Aligned files were coordinate-sorted, and duplicate reads were removed. Read group identifiers were updated using PicardTools. The quality of aligned BAM files was checked using qualimap21. Variants were called using bcftools-mpileup22. This variant calling pipeline was previously standardized in our laboratory23 and two sets of variant calling were performed. Set-I variants were called based on the latest Murrah buffalo reference genome GCF_019923935.1_NDDB_SH_1_genomic.fna (https://www.ncbi.nlm.nih.gov/assembly/GCF_019923935.1) and variants were retained for further training of the dataset to predict eQTL weights and individual level transcriptome wide association study. A second set (Set-II) of variants were called based on the Mediterranean buffalo reference genomeGCF_003121395.1_ASM312139v1_genomic.fna(https://www.ncbi.nlm.nih.gov/assembly/GCF_003121395.1) for GWAS.

Quality control (QC) check of variants for downstream analysis

Set-I SNPs were further QC checked using PLINK v1.924 and all the indels were removed from further analysis. Only biallelic variant sites on autosomes and X chromosome having a genotype rate > 95% were retained. Variants passing the threshold of Hardy–Weinberg equilibrium test at p < 0.0001 and minor allele threshold of 0.01 were retained for TWAS.

Set-II SNPs were also QC checked using PLINK v1.9 with thresholds of genotyping rate > 95%, linkage disequilibrium (LD) in terms of r2 < 0.8, not deviating from HWE at p < 0.0001, and with MAF > 0.05. Only autosomal and X chromosomal SNPs were retained for GWAS.

Genome-wide association study (GWAS)

GWAS was conducted with a set of 39,019 QC passed Set-II SNPs, for the 1st lactation 305 days milk yield, peak yield, wet average, fat% and SNF% using the true phenotypes for all the 144 individuals that had completed the 1st lactation with a standard 305 days of lactation. Genome-wide identity-by-state (IBS) for all pairs of individuals was checked. Multidimensional scaling (MDS) based on SNP information was done to check for the presence of any population stratification and was corrected by incorporating the first two MDS components as covariates in the model for GWAS. Birth weight (bwt) and age at first calving in months (AFC) were also included as covariates in the model. A genome-wide scan for significant SNPs considering only additive effects was accomplished through a simple regression model using PLINK v1.9 as described by Marees et al.25, where residuals were assumed to be normally and independently distributed. A linear regression model was fitted for determining the association between SNPs and continuous traits26. The threshold for genome-wide significance was determined by correcting the P-values of the SNP association test with Bonferroni’s correction and was 1.28 \(\times\) 10–6. P-values of the top ten SNPs of each GWAS traits was corrected for Benjamini–Hochberg’s false discovery rate (FDR) at 5% levels27 using the “R” package fuzzySim v3.028 and is given in the supplementary document. The results were plotted as Manhattan plots and Q-Q plots using the “qqman” package of R.

Linear regression model used for GWAS:

$$y = \, \upbeta_{0} + x*\upbeta_{{1}} + {\text{C}}_{{1}} *\upbeta_{{2}} + {\text{C}}_{{2}} *\upbeta {3}_{{}} + {\text{AFC}}*\upbeta_{{4}} + {\text{ bwt}}*\upbeta_{{5}} + {\text{ e}}$$

where, y = 1st lactation 305 days milk yield, x = additive effect of SNPs, C1 = first component of MDS, C2 = second component of MDS, AFC = Age at first calving in months, bwt = birth weight of the animal (in kgs), β0 = intercept term, β1 = regression coefficient representing the strength of association between SNP x and trait y, β2 = regression coefficient of C1, β3 = regression coefficient of C2, β4 = regression coefficient of AFC, β5 = regression coefficient of bwt, and e = residuals or noise not explained by SNPs. Eldawy et al.29 reported significant effect of body weight at birth and AFC on the reproduction and production performances in dairy buffaloes. Hence, these two variables were considered as important covariates for the GWAS and TWAS models.

Generation of gene expression information

For integrating transcriptomic information to find the underlying gene significantly contributing to the expression of complex lactational traits, 8 animals in the 2nd parity in a mid-lactation stage in the winter season were selected as reference animals for the TWAS having both genotype and phenotype data. The animals from the institute herd that had calved during the autumn and reached the peak lactation stage during the winter were selected as reference animals. To maintain the homogeneity, milk samples were collected from the animals in the mid lactation where animals attend their peak during the beginning weeks of this stage, which comes during the winter in the present study. The 1st lactation average in the herd was 2122.5 ± 286 kg. The reference animals were divided into two groups; above average + 1σ were considered high yielders (N = 5; > 2400 kg/lactation) while animals below average-1σ were treated as low yielders (N = 3; < 1800 kg/lactation) in the present study. Approximately, 150–200 ml of milk was collected aseptically in Diethyl pyrocarbonate (DEPC) treated tubes. RNA isolation was performed under sterile conditions in lab. RNA isolation was done following a hybrid protocol30 from the fat layer of the milk. Extracted RNA quantity was checked on Qubit 4.0 fluorometer and library was prepared for good quality samples with high RIN (> 6.5) values. Sequencing was performed using Illumina Novoseq 6000 platform. RNAseq data analysis was performed following the standard Galaxy workflow31. Adapters were trimmed using cutadapt v3.7 allowing a maximum error rate of 0.1. Trimmed RNAseq fastq files were aligned to the reference genome GCF_019923935.1_NDDB_SH_1_genomic.fna using BWA-MEM algorithm. Aligned BAM files were sorted by chromosomal coordinates and other post-alignment cleaning processes such as deduplication, and sample information update were completed using picardtools. Qualimap–RNAseqQC and BAMQC v2.2.2-dev were used for checking the quality of aligned BAM files. Feature counts were generated using featureCounts v2.0.1. assuming reads are forward stranded. Fragments were counted for the paired-end data only if both the reads were aligned after removing chimeric fragments. rLog normalized gene expression levels from DESeq2 v2.11.40.7 were obtained after correcting for the “production level” i.e., high and low yields and “batch of sample collection”.

Transcriptome-wide association study

To perform a two stage TWAS, first gene expression imputation model was designed for estimating the cis-eQTL effect sizes from a training sample (N = 8) for which both genotype and transcriptome data are available. The model suggested by Nagpal et al.14 was employed which is as follows:

$${\text{E}}_{{\text{g}}} = X_{{{\text{train}}}} {\text{w }} + \varepsilon , \, \varepsilon \, \sim N\left( {0, \, \upsigma_{\varepsilon }^{{2}} I} \right)$$
(1)

where, Eg: denotes the log normalized gene expression levels (after corrections for confounding factors such as production levels and sampling batch) for gene g. Xtrain: denotes the genotype matrix for all cis-genotypes (encoded as the number of minor alleles present 1 MB of the gene; [− 1 MB—Gene_start—Gene_end— + 1 MB]). w: denotes the corresponding cis-eQTL effect-size vector, and ε: denotes the error term.

The gene expression levels GReX (genetically regulated gene expression) of the test samples (N = 136) were imputed with the assumption of following model:

$$G\hat{R}eX = X_{{{\text{test}}}} \hat{w}$$
(2)

Given the predicted eQTL effect size estimates \(\hat{w}\) from the training data in Eq. (1), GReX was imputed by the Eq. (2) where Xtest is the genotype matrix containing cis-SNP data for the test dataset.

For training and prediction of the \(G\hat{R}eX\), both non-parametric Bayesian DPR method and parametric Elastic-Net model were used each with 5X cross validation (CV) and without any CV. Training, prediction, and association with phenotypes was accomplished using TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits. ~ 15 simulations were run to test the appropriate parameters for the training model. Training was done with fivefold cross validation and without cross validation (leaving two out of 8 samples rotationally per iteration and training with all 8 samples) for both DPR and Elastic-Net model. An overall TWAS workflow is presented in Fig. 1.

Figure 1
figure 1

A brief workflow of the present study depicting steps from data acquisition to final genome-wide and transcriptome-wide associations.

Only the additive genetic effects of the cis-SNPs on genes were estimated as non-additive such as dominance and interaction effects tend to be overestimated in the small training samples, and estimation of only additive effects provide better prediction accuracy. SNPs were excluded if missing rate exceeded 0.2. Those SNPs having a MAF < 0.01 and deviating from the Hardy–Weinberg Equilibrium at p < 0.0001 were also excluded from the training. For training and individual level associations 1,64,830 QC passed Set-I SNPs were used. Association was performed based on the following model.

$${\text{f}}\left( {{\text{E}}\left[ {{\text{Y}}\left| {{\text{X}},{\text{C}}} \right.} \right]} \right) = \upeta {\text{C}} + \upbeta G\hat{R}eX$$
(3)

where, f(.) is a pre-specified link function, which is set as identity function for the quantitative phenotype, [Y|X,C]: Phenotype given genotype matrix Xtest and covariate matrix C, H0: β = 0 in Eq. (3).

Same covariates as that of GWAS i.e., AFC, Birth weight, and 1st two components of MDS of set-II variants were taken to maintain homogeneity. TWAS was performed for 305 DMY, PY, WA, fat% and SNF% in the test individuals (N = 136). A detailed methodology is provided in the supplementary file for methods.

Comparison of GWAS and TWAS results

Chromosome wise TWAS results for each trait and each model were combined to generate TWAS Manhattan plots. Manhattan plots generated from GWAS for each trait were compared with the TWAS Manhattan plots. Based on the significant genes and peak signals from the TWAS results via the DPR method, important genes were identified for the studied lactational traits. A TWAS hit gene’s midpoint position ± 1.5 Mb stretch was checked for presence of other potential TWAS hits those couldn’t be detected directly from the TWAS. Genes that are lying within that stretch of 3 Mb was considered for pathway enrichment. The pathways enriched with p < 5 × 10–2 were considered to be significantly enriched for the respective traits and the genes involved in those pathways were selected as probable candidate genes through the online gene ontology analysis platform gProfiler. Genes near the ± 20 kb of the GWAS hit SNPs and TWAS hit genes were compared for any shared genes among all methods. SNPs having highest positive and negative weights on prediction of those genes were recommended as important markers for further studies.

Ethics statement

The animal study was reviewed and approved by the ICAR-National Dairy Research Institute (IAEC). All experiments were performed in accordance with relevant guidelines and regulations.

Results

Genotyping by sequencing and total mRNA sequencing

An average of 2.31 million each of forward and reverse reads of 151 base pairs (bp) were obtained per sample after the ddRAD sequencing. The average GC content of the reads was 51.20% with 35–40 Phred score (Q). Raw reads were aligned to the latest reference genome GCF_019923935.1_NDDB_SH_1_genomic.fna with 97.59% mapping rate and 52.34 average mapping quality. Clean BAM files obtained after sorting, trimming, and duplicate removal were used for variant calling and 57,92,182 polymorphic sites containing 55,60,412 SNPs and 2,31,770 indels were obtained. A total of 1,64,830 SNPs (Set-I) those passed QC were finally retained for GReX prediction for TWAS. For GWAS, variants (Set-II) were called using Mediterranean reference buffalo genome as the variants called using latest reference genome showed high genomic inflation when used for GWAS. The reads were mapped to the Mediterranean reference genome with 95.45% mapping rate and 32.73 average mapping quality. A total of 5,804,693 variants were obtained among which 5,544,733 were SNPs and 2,59,960 were indels. After applying quality control threshold for GWAS (Table S1), a total of 39,019 SNPs were retained for the GWAS.

An average of 30.68 GB of raw data per sample in paired end mRNA sequencing was obtained. The average genome coverage was 2.59X with 87.06% mapping rate to the reference genome (GCF_019923935.1_NDDB_SH_1). After quality control, number of reads counted as fragments for the paired end data and an average of 38.73% of reads were mapped to exonic region (Table 1). Total 33,347 features were counted as genes and their expression values were obtained.

Table 1 RNAseq feature count and assignment details obtained after counting the fragments via featureCounts v2.0.1.

Genome-wide association study (GWAS)

GWAS was performed for 1st lactation 305 days milk yield, peak yield, wet average, fat% and SNF% taking AFC, birth weight, and 1st two MDS components as covariates. The association results for each trait are depicted through Manhattan plot along with the association results of TWAS for that particular trait. The Q-Q plot signifying the distribution of p-values with respect to the test hypothesis is presented in Fig. S1. The top ten SNPs with respect to highest –logP values and the genes in their vicinity of ± 20 kb are presented in the Table 2.

Table 2 Genes annotated ± 20 kb of the top ten SNPs having lowest p-values in GWAS for each trait.

Transcriptome-wide association study (TWAS)

Model training and gene-expression prediction

Set-I SNPs (1,64,830) were used to predict 26,956 genes which were further used to perform TWAS. Training accuracy was higher in case of ENET model through both with- and without cross-validation with an average of 0.67 while through DPR model training accuracy was 0.49 without cross-validation and 0.48 with cross-validation. λ value of was found to be 0.32 on an average across the chromosomes in the ENET model for both the methods indicating that regression coefficients were moderately shrinked with α when assumed to be 0.5 and with α determined by cross-validation. Chromosome wise training accuracy across all models and methods is presented in Table 3. The cis-QTL weights for the SNPs predicted through the DPR model with and without cross-validation were same; hence, the prediction of gene expression levels and their association with the respective traits was done only for the DPR model without cross-validation method. While separate association results were obtained for ENET with and without cross-validation methods.

Table 3 Chromosome-wise DPR and ENET model performances in gene-expression training and gene prediction.

Transcriptome-wide association results

The TWAS results are presented as a Manhattan plot for the different models along with GWAS results in Figs. 2, 3, 4, 5, 6 for 305 days milk yield, peak yield, wet average, fat%, and SNF%, respectively. The top 10 genes having the lowest P-value along with their FDR corrected P-values were identified and are given in the supplementary document (Tables S2S6).

Figure 2
figure 2

Manhattan plot showing the results of associations with the 305 DMY (A) GWAS (B) TWAS by ENET without CV (C) TWAS by ENET with 5X CV, and (D) TWAS by DPR without CV models. The red line indicates genome-wide p-value threshold (expressed as –log10P) corresponding to Bonferroni corrected p-values, above which the SNPs are considered to be significantly associated with the trait in GWAS, while the blue line indicates genome-wide significant threshold of Bonferroni corrected p-values for TWAS models. The red ovals surrounding various genomic region suggest the significant SNP/genes and peak association signals.

Figure 3
figure 3

Manhattan plot showing the results of associations with the PY (A) GWAS (B) TWAS by ENET without CV (C) TWAS by ENET with 5X CV, and (D) TWAS by DPR without CV models. The red line indicates genome-wide p-value threshold (expressed as –log10P) corresponding to Bonferroni corrected p-values, above which the SNPs are considered to be significantly associated with the trait in GWAS, while the blue line indicates genome-wide significant threshold of Bonferroni corrected p-values for TWAS models. The red ovals surrounding various genomic region suggest the significant SNP/genes and peak association signals.

Figure 4
figure 4

Manhattan plot showing the results of associations with the WA (A) GWAS (B) TWAS by ENET without CV (C) TWAS by ENET with 5X CV, and (D) TWAS by DPR without CV models. The red line indicates genome-wide p-value threshold (expressed as –log10P) corresponding to Bonferroni corrected p-values, above which the SNPs are considered to be significantly associated with the trait in GWAS, while the blue line indicates genome-wide significant threshold of Bonferroni corrected p-values for TWAS models. The red ovals surrounding various genomic region suggest the significant SNP/genes and peak association signals.

Figure 5
figure 5

Manhattan plot showing the results of associations with the Fat% (A) GWAS (B) TWAS by ENET without CV (C) TWAS by ENET with 5X CV, and (D) TWAS by DPR without CV models. The red line indicates genome-wide p-value threshold (expressed as –log10P) corresponding to Bonferroni corrected p-values, above which the SNPs are considered to be significantly associated with the trait in GWAS, while the blue line indicates genome-wide significant threshold of Bonferroni corrected p-values for TWAS models. The red ovals surrounding various genomic region suggest the significant SNP/genes and peak association signals.

Figure 6
figure 6

Manhattan plot showing the results of associations with the SNF% (A) GWAS (B) TWAS by ENET without CV (C) TWAS by ENET with 5X CV, and (D) TWAS by DPR without CV models. The red line indicates genome-wide p-value threshold (expressed as –log10P) corresponding to Bonferroni corrected p-values, above which the SNPs are considered to be significantly associated with the trait in GWAS, while the blue line indicates genome-wide significant threshold of Bonferroni corrected p-values for TWAS models. The red ovals surrounding various genomic region suggest the significant SNP/genes and peak association signals.

TWAS results for 305 days milk yield, peak yield, and wet average were checked to assess the role of important genes associated with milk production. From the corresponding Manhattan plots of TWAS for 305 days milk yield, peak signals of probable associations were observed on BBU10 at ~ 4 Mbp, BBU15 at ~ 37 Mbp and ~ 81 Mbp, and BBU6 at ~ 1 Mbp. Notably, the well-known candidate gene DGAT1 for milk yield is positioned at 81362831–81371652 bp on BBU15. Similar peak patterns were observed on the BBU15 at ~ 81 Mbp from the TWAS results for wet average. Apart from the TWAS results of the DPR model, TWAS for 305 DMY through Elastic Net model with 5X cross-validation revealed NDUFA11 on BBU9 at ~ 92 Mbp to be the top gene with lowest p-value. NDUFA11 was also found as the top gene with lowest p-value for the peak yield TWAS through Elastic Net model with cross-validation. From the peak yield TWAS through DPR model, peak signals were observed at BBU2 at ~ 17 Mbp, BBU20 at ~ 11 Mbp, BBU24 at ~ 38Mbp, and BBU9 at ~ 69Mbp. Though no significant associations were observed for milk production traits through TWAS, FDR corrected P-values in DPR model denoted significant association of TRNAW-CCA_36, LOC102411990, LOC102415512, LOC102416173, and LOC123335276 with the peak yield. LHFPL5 and RFTN2 were significantly associated with the SNF% in the ENET models.

For milk production, CREB3L3, GNA15, GNG7, MAP2K2, MAP2K7, SHC2, and ADCY9 were found to be involved in the Relaxin signaling pathway. TNFRSF11B, AMH, CCL25, CD70, EBI3, IL13, IL4, IL5, TNFSF14, TNFSF9, GDF9, LTA, LTB, TNF, and TNFRSF12A were significantly enriched in the cytokine-cytokine receptor interaction. IL13, IL4, IL5, MAP2K2, MAP2K7, VAV1, and TNF were significantly enriched in the Fc epsilon RI signaling (FcεRI) pathway. GNA11, MAP2K2, HCN2, KISS1R, and GABBR1 were enriched in the GnRH secretion pathway. CREB3L3, GNA11, MAP2K2, SHC2, and ADCY9 were enriched in the Growth hormone synthesis, secretion and action pathway. A novel uncharacterized gene LOC112578579 at BBU13 was found to be significantly associated with Fat% in the DPR without CV TWAS model. Along with that, peak signals were found on BBU13 at ~ 68 Mbp, and BBU6 at ~ 16 Mbp. For Fat%, AMPK signaling pathway was significantly highlighted by both KEGG and Wiki pathways. The genes found to be involved in the AMPK signaling were CCNA1, CAB39L, CREB3L4, CRTC2 and FOXO1. FOXO1 was also significantly enriched in the constitutive androstane receptor pathway and adipogenesis. EBPL gene was found to be significantly enriched for the biological process sterol metabolism.

Though DPR was considered as the most suitable model for our population, genes such as LHFPL5 and RFTN2 were found to be significantly associated with SNF% in the ENET 5X CV and ENET without CV models, respectively. Meanwhile, in the TWAS results of the DPR model, peak signals were observed on BBU2 at ~ 141.9 Mbp, BBU9 at ~ 100 Mbp, and on BBU17 at ~ 10 Mbp. AOX1, IL27RA, TYK2, and PTPN11 genes were significantly enriched in the JAK-STAT signaling pathway. Genes such as SDS and SDSL were enriched for the valine, leucine and isoleucine biosynthesis and cysteine and methionine metabolism pathway.

The novel genes identified for milk production, Fat%, and SNF% recommended for further studies are given in the Tables 4, 5, 6. The genes identified through the TWAS were then further checked for the SNPs having highest negative and positive weights for the prediction of that particular gene. The two top SNPs having the most negative and positive effect sizes for each TWAS highlighted genes were filtered from the gene prediction result files to obtain the important SNPs to be used as important markers in future selection programs (Tables S7S9).

Table 4 Chromosomal position and description of candidate genes identified for milk production.
Table 5 Chromosomal position and description of candidate genes identified for milk fat percentage.
Table 6 Chromosomal position and description of candidate genes identified for milk SNF percentage.

The genes identified through GWAS and TWAS hits were checked for common genes between them but no such common genes could be found (Fig. 7). Only ENET CV and without CV method showed 1, 4, and 2 genes for fat%, peak yield, and SNF%, respectively.

Figure 7
figure 7

Venn diagram showing number of shared genes between the top ten list of GWAS and TWAS. (A) 305 days milk yield, (B) Peak yield, (C) Wet average (D) Fat %, and (E) SNF % (DPR without CV, ENET with CV, and ENET without CV) (the darkest blue colour indicates no shared genes among the methods).

Discussion

This study is a first of its kind to be conducted in dairy animals to obtain transcriptome-wide association of predicted gene expression levels with economically important production traits of Murrah buffalo. Since the conception of TWAS as a post-GWAS prioritization tool in humans to identify and delineate the biological function of causal genes responsible for various diseases and quantitative phenotype viz. height, the field of animal science is lagging way behind in exploring the same in livestock species to augment genomic selection programs. Though GWAS have been very effective and one of the most widely used method to map several causal loci for complex traits; yet the biological gaps are still evident. The majority of GWAS-hit loci lie in non-coding regions and, even though they might play a role in gene expression regulation, its physiological perspective is unclear. In dairy animals, most of the variants contributing to complex lactation traits have not yet been identified, as their effect sizes are too small to be detected at current GWAS sample sizes. Hence, the need to perform TWAS like post-GWAS studies in animal species couldn’t be undermined any longer and the present study is an application of the concept in one of the major dairy animal species of India i.e., Murrah buffalo.

Genotyping by sequencing (GBS) using double digestion RAD tag technology have already been proven to be efficacious in terms of cost, speed of genotyping, robustness and is generalized to any species32,33. GBS aids in obtaining a large number of genome-wide SNP information for exploring within-species diversity, constructing haplotype maps and performing genome-wide association studies (GWAS)34, at a lesser expense than other contemporary methods. The ddRAD genotype information generated in the present study provided 1X genome coverage, and upon alignment to the reference genome 97.59% mapping rate was observed proving its suitability to be used in the GWAS and TWAS studies in the Murrah buffalo.

The transcriptome information for the present study was obtained by sequencing total mRNAs from mammary epithelial cells of lactating animals. Milk is a heterogeneous source of somatic cells composed of lymphocytes, neutrophils, macrophages and exfoliated epithelial cells35. The expression of genes involved in cell turnover, milk synthesis, or hormonal regulation in the mammary tissue is a key determining factor for milk production in ruminants. The applicability of the milk-isolated MECs to analyze mammary gene expression has been substantiated through many studies, as the gene transcript variations were also in accordance with milk yield and composition variations. Mammary epithelial cells (MEC) are unique in the way they are involved in the synthesis and secretion of milk, despite popularity of milk somatic cells in analysis of the gene expression for milk synthesis in ruminants36. Milk isolated mammary epithelial cells produce similar transcript variation profile that is consistent with variations in milk yield and compositions37. The genome-wide gene expression levels obtained from the MECs are well representation of the genome-wide lactation specific genes in the present study.

As in one of our previous studies, we highlighted the reasons for taking a GWAS sample size in hundreds as optimum to find significant results considering the buffalo farming scenario in Indian sub-continent23. Majority of organized buffalo farms in India have a herd size of 250–500 and only 100–200 breedable population, along with an absence of any functional buffalo sequencing consortium in India, renders sequencing of few hundreds of animals with sufficient genetic diversity to be feasible. In a such a case, when we considered genotyping 144 unrelated Murrah buffaloes, proportionately, for performing TWAS, total mRNA sequencing of 8 individuals were considered optimum, and the results obtained through TWAS evidently showed the robustness of the method and applicability in Murrah population despite the low sample size.

The gene expression prediction models i.e., Elastic-Net and DPR with 8 samples were trained. Nagpal et al.14 had discussed the higher power of TWAS using DPR over TWAS using Elastic-Net model. They have shown through a series of simulation along with real data study with ROS/MAP data that indicated the superior performance of DPR model over Elastic-Net model implemented in the PrediXcan, in terms of TWAS power and imputation R2. The advantage of DPR model lies in its flexibility of non-parametric Bayesian modelling that predicts higher number of genes with a better imputation R2 at causal gene proportions \(\ge\) 0.01 and at he2 \(\le\) 0.2. As, the lactation traits are of polygenic nature and are controlled by many genes, proportions of causal cis-SNPs are expected to be much higher and there is no prior information regarding the effect size distribution of cis-eQTLs, the DPR method of gene expression prediction can be considered as a choice of model for such a case. This assumption was also consistent with the results obtained in the present study that showed higher number of predicted genes through both DPR 5X cross-validation and without cross-validation methods than Elastic-Net model. DPR without CV model predicted 12.16% more number of genes than ENET without CV model, while DPR 5X CV model predicted 6.59% higher genes than ENET 5X CV. DPR without CV model predicted 80.92% of the genes used for training. The average training R2 was ~ 0.48 in DPR model while it had significantly higher value of ~ 0.67 through ENET model. Higher training R2 than 0.5 may seem as a model overfitting due to small sample sizes, which was also observed by Parrish et al.38 in their TWAS study with 49 tissue types. As DPR model is reported to be generalized, flexible, robust, and accounts for better prediction performance across broad genetic architectures38,39, and also in the present study it predicted higher number of genes with a reliable training R2, we selected DPR as choice of gene expression prediction model for our Murrah population. However, the weights predicted for the DPR model through 5X CV and without CV were found to be same for both which may be due to the small sample size. Hence, only the DPR without CV model was included for further study along with both ENET models.

A final association study was conducted for different 1st lactation traits with gene expressions predicted using various models and the results were compared with GWAS results. For comparison purpose the phenotype association model was same for both the GWAS and TWAS. The TWAS results from DPR without CV, ENET 5X CV, and ENET without CV showed higher numbers of true positives than GWAS for all the traits. The P-values of associations were adjusted for multiple testing by Bonferroni’s correction and FDR-BH (False Discovery Rate by27. Significant associations were observed after adjusting for multiple testing for fat% and SNF% using TWAS while no such associations could be observed in GWAS. The FDR of the top 10 genes for different traits in various TWAS models shows the number of suggestive associations that could be truly positive but couldn’t pass the Bonferroni’s threshold possibly due to the small sample size.

The present study revealed Relaxin-signaling pathway as a regulator for milk yield. Previously, ADCY5 of the Adenylate Cyclase family was reported to be a candidate gene for regulating milk yield in buffaloes40, which also signifies the role of Adenylate Cyclase family genes in regulation of milk yield. Upon the network analysis, it was observed that ADCY5 is also a key regulator of the Relaxin-signaling pathway. Although, Relaxin is a well-known hormone secreted during pregnancy in some species to soften the cervix and prepare the reproductive tract for parturition, it was also reported to have major mammogenic role in sows41. CREB3L3 is also reported to act as a central regulator of energy homeostasis through AMP signaling pathway in dairy cows42. Ye et al.40 reported INHBA and INHBB to be involved in cytokine-cytokine interaction pathway and this pathway has been reported to be a significant regulator of milk yield. Ahlawat et al.43 reported that genes down-regulated in milk somatic cells of buffaloes were significantly enriched in cytokine receptor interaction pathway. Several other genes of cytokine receptor families were identified to be involved in heat tolerance in water buffaloes44.

In Nili-Ravi buffaloes, Prolactin (PRL) a major gene involved in mamogenesis, regulation of milk protein, and milk secretion is reported to be regulated by the cytokine-cytokine interaction pathway45. Ye et al.40 reported INHBA and INHBB as candidate genes in regulation of milk yield that were also significantly enriched in the cytokine-cytokine receptor interaction pathway in the present study. Several genes from the TWAS results along with previously reported PIK3R140 were significantly enriched in the Fc epsilon RI signaling (FcεRI) pathway. FcεRI is required for cell membrane expression and intracellular signal transduction. Milk production TWAS genes also found to be involved in GnRH secretion and growth hormone synthesis, secretion and action pathway. Several reports indicate that growth hormone and growth hormone receptor genes play a vital role in growth of mammary gland in lactating females and regulation of milk yield. The GHR gene is implicated in lipid and carbohydrate metabolism and maintaining lactation46. GHR polymorphism has been reported to be associated with milk yield in buffaloes47.

AMPK signaling pathway is previously reported to be involved in regulation of milk production48, milk fat and protein synthesis49. The AMP-activated protein kinase (AMPK) was also reported to control lipid and lactose synthesis in bovine mammary epithelial cells50. AMPK signaling pathway was also reported to be involved in modulation of milk yield in buffaloes with ELAVL1, RAB11B, ADIPOR2, ADRA1A, INSR, LEP, PIK3CA, SCD, and TSC1 genes as nodes51. FOXO1 is a member of fork-head family of transcription factors that plays a vital role in gluconeogenesis in the liver52. Jacometo et al.53 also suggested the role of FOXO1 in milk fat synthesis. FOXO1 was reported to be differentially expressed for milk fat traits in Chinese Holstein cattle54. Sterol metabolism reported to be a critical regulator of milk fat synthesis in dairy cows and several sterol regulatory element-binding proteins have been characterized as the candidate genes for the milk fat synthesis in mammary epithelial cells of dairy cows55.

Many studies have highlighted the role of JAK-STAT signaling pathway in mammary gland development and milk production. Khan et al.56 have reviewed several works highlighting the role of this pathway in milk casein gene regulation and interaction of lactogenic hormone receptors with JAK-STAT pathway to regulate milk proteins. Prolactin receptor is also reported to regulate few JAK-STAT-associated proteins that balances the growth hormone in relation to milk protein yield57. Ji et al.58 have also highlighted the role of STATs in regulating the 5′ flanking regions of Whey acidic protein (WAP) that is expressed in the mammary gland.

Methionine is the limiting amino acid for the ruminants and is essential for the milk protein synthesis whereas, valine, leucine and isoleucine are also essential amino acids that are reported to be potentially limiting for milk protein synthesis59. As the genes identified in the study play important role in pathways regulating milk yield either directly or indirectly, they can be considered as candidate genes for milk yield and its composition traits.

Conclusion:

In a dairy breeding program, the prior knowledge about the distribution of eQTL effect size is often not considered. Non-parametric Bayesian based method could be an excellent choice to predict the eQTL effects. DPR is a one such model of gene expression prediction, this model can accommodate across tissue information, which improves the prediction accuracy. Our study concludes that the TWAS in the Murrah buffaloes for lactation traits proved to be more robust and efficacious than conventional GWAS even when the sample size are not large. Reasons could be a higher statistical power associated with TWAS. We were able to map important causal genes and many true positive associations for almost all the traits even with a small sample size using TWAS approach.