Identification of functional features underlying heat stress response in Sprague–Dawley rats using mixed linear models

Since global temperature is expected to rise by 2 °C in 2050 heat stress may become the most severe environmental factor. In the study, we illustrate the application of mixed linear models for the analysis of whole transcriptome expression in livers and adrenal tissues of Sprague–Dawley rats obtained by a heat stress experiment. By applying those models, we considered four sources of variation in transcript expression, comprising transcripts (1), genes (2), Gene Ontology terms (3), and Reactome pathways (4) and focussed on accounting for the similarity within each source, which was expressed as a covariance matrix. Models based on transcripts or genes levels explained a larger proportion of log2 fold change than models fitting the functional components of Gene Ontology terms or Reactome pathways. In the liver, among the most significant genes were PNKD and TRIP12. In the adrenal tissue, one transcript of the SUCO gene was expressed more strongly in the control group than in the heat-stress group. PLEC had two transcripts, which were significantly overexpressed in the heat-stress group. PER3 was significant only on gene level. Moving to the functional scale, five Gene Ontologies and one Reactome pathway were significant in the liver. They can be grouped into ontologies related to DNA repair, histone ubiquitination, the regulation of embryonic development and cytoplasmic translation. Linear mixed models are valuable tools for the analysis of high-throughput biological data. Their main advantages are the possibility to incorporate information on covariance between observations and circumventing the problem of multiple testing.


Scientific Reports
| (2022) 12:7671 | https://doi.org/10.1038/s41598-022-11701-y www.nature.com/scientificreports/ and gene expression levels between the control and the heat-stress groups. In contrast to the vast majority of approaches, when particular transcripts are considered as independent, we incorporated biological information on the correlation between them directly into the statistical model. Furthermore, we also considered higherorder units, expressed by Gene Ontology terms and Reactome pathways, since these are the actual functional components of the physiological response towards heat stress. The incorporation of functional units into the analysis of RNA-seq data is a known idea. It has been applied either through enrichment analysis as e.g. by Dou et al. 14 to the data set analysed in our study or through the Ingenuity Pathway Analysis as e.g. by Lan et al. 15 to gene expression under heat stress in poultry. Nevertheless, the major difference between our approach and the aforementioned studies is that the former conducts a twostep analysis, by first identifying significantly differentially expressed transcripts and genes by considering them independent (step 1) and then further analysing the selected, significant transcripts and genes in terms of the correlation between them or the enrichment of ontologies or pathways in this significant gene set (step 2). The approach proposed in our study attempted to combine the process within one-step by including the correlation of the functional information directly into the model fitted to the whole (i.e. not only to significant) expression data and is statistically similar to the linear mixed model fitted to gene expression data by Wang et al. 16 , albeit with a different approach to the modelling of the covariance of random effects.
In the current study, we illustrate the application of the mixed models using the whole genome transcript expression data of Sprague-Dawley rats obtained from a heat stress experiment 14 .

Methods
Experimental animals. The data set underlying the analysis is a subset of the material used by Dou et al. 14 .
In brief, the analysed individuals comprised eight weeks old female, specific-pathogen-free, Sprague-Dawley rats. Prior to the experiment, all rats were housed in a laboratory at 22 ± 1 °C, 50% relative humidity with 12 h reverse light/dark cycle with feed and water provided ad libitum. After one week, five rats from the heat-stressed group were exposed to 42 °C for 120 min, while five rats from the control group were housed at an initial temperature of 22 °C. After the completion of the experiment, the animals were euthanised and samples of liver and adrenal gland tissues were used as a source of total RNA. Bioinformatic analysis. Illumina HiSeq2000 was used to sequence 150 long reads in the paired-end (PE) data mode. The total number of read pairs per sample was from 51,706,978 to 97,059,004. The detailed description of the RNA isolation and sequencing was also provided by Dou et al. 14 . The bioinformatic pipeline consisted of the following steps: quality control of raw reads, editing of raw reads based on their quality, and quantification of transcripts' expression. In particular, the quality of raw reads was assessed by applying the FastQC software 17 . Then, reads were processed by the Trimmomatic software 18 , which removed adapter sequences, trimmed reads with an average sequencing quality of 4 consecutive reads below 20 (SLIDINGWINDOW:4:20), and removed reads shorter than 60 bp (MINLEN:60). Next, the Salmon software 19 was used to quantify the abundances of transcripts. This software implements the pseudoalignment process-an approach allowing for rapid identification of the compatibility of reads with transcripts, without the need of a computationally intensive whole genome alignment. In the last step, log 2 fold changes in transcript expression levels between the control and heat stressed groups were calculated using DESeq2 20 .
Statistical modelling of expression data. The log 2 fold changes (log 2 FC) calculated based on the transcript expression levels pooled over the control and heat-stressed animals respectively, were analysed in four mixed linear models.
The transcript-based model (M1) is given by: where y is the vector of log 2 FC of transcript expression, µ represents the general mean, t is the random transcript effect with a predisposed normal distribution defined by N 0, V M1 σ 2 t , e M1 is a vector of residuals distributed as N 0, Iσ 2 e M1 , Z M1 is an incidence matrix for t . In this model, the similarity between transcripts i and j, was introduced into the model by incorporating a nondiagonal transcript covariance matrix V M1 . The covariance between transcripts was expressed by the Jaccard similarity coefficient: J(A, B) = |A∩B| |A∪B| , which in the case of transcripts was calculated based on the similarity in their exon composition: where a represents the number of exons common between transcripts i and j, while N represents the total number of exons of a given gene. Independence was assumed between genes, so the resulting matrix had a block diagonal structure. Transcript information was obtained from the Ensembl database, release100 21 and Jaccard coefficients were calculated using the ADE-4 package 22 .
The gene-based model (M2) was applied to the same dependent variable ( y ) as M1 and is given by: where g is the random gene effect with a preimposed normal distribution defined by N 0, V M2 σ 2 g , e M2 is a vector of random residuals distributed as N 0, Iσ 2 e M2 , Z M2 is an incidence matrix for g . No covariance between genes was assumed so V M2 was diagonal.
Furthermore, the Gene Ontology-based model (M3): where go is the random effect of Gene Ontology (GO; geneontology.org) terms assigned to transcripts whose log 2 FC of transcript expressions are contained in y . It was assumed that go follows the normal distribution N 0, V M3 σ 2 go , e M3 is a vector of residuals distributed as N 0, Iσ 2 e M3 , and Z M3 is an incidence matrix for the go terms. V M3 describes the covariance between GO terms expressed, as above, by the Jaccard coefficients quantifying genes overlapping between two given GO terms. Each transcript was assigned GO term(s) from the biological process ontology, considering the ontologies from the 2nd hierarchy level. Finally, a model incorporating the effects of Reactome pathways (M4) was fitted to transcript log 2 FC: in this model, r represents the random effect of Reactome pathways (reactome.org) corresponding to transcripts. The normal distribution pre-imposed on r is given by N 0, V M4 σ 2 r , e M4 is a vector of residuals distributed as N 0, Iσ 2 e M4 , and Z M4 is an incidence matrix assigning Reactome pathways to transcripts. The covariance matrix between Reactome pathways ( V M4 ) was also expressed by the Jaccard coefficients quantifying genes overlapping between the two given pathways.
Estimation of variance components and significance testing. The expectation-maximization algorithm 23 was applied for the estimation of variance components underlying the four above models (i.e. ) and the mixed model equations 24 were used to obtain solutions of µ, t, g, go and r: with x representing t, g, go or r depending on the model, and i ∈ {M1, M2, M3, M4} being a model indicator. Each element of the solution vectors t, g, go and r was transformed to the standard normal distribution and tested for significance by assessing the probability of obtaining a more extreme value based on the standard normal density function.
Enhancement of the computational efficiency of the estimation. In order to maximise the computational performance of the estimation of model parameters and its variance components, a custom-written Python program implementing the Numba library 25 was used. Numba compiles a subset of native Python and NumPy code into the machine code. Since all calculations were carried out on a multicore server, the Numba library was also used to parallelize the code, what further improved the computing time compared to a native Python application.
Ethics approval. Not applicable, only in silico data was processed in this study.

Results
Obviously models M1 and M2, fitting thousands of components, i.e. transcripts (M1) or genes (M2) explain a much larger proportion of the observed variability of log 2 FC than the M3 and M4 fitting the functional components, i.e. GO terms (M3) or Reactome pathways (M4). Still, by taking into account the correlation between transcripts expressed by their exon composition, we explain 12.43% and 18.32% of the total variability of log 2 FC observed in liver and adrenal tissues, respectively. A similar picture arises when the assumed source of variability of log 2 FC is considered on a gene level, resulting in 10.12% and 16.45% variance components in liver and adrenal tissues respectively-somewhat lower than for correlated transcripts in M1. By further shrinking the functional units to GO terms and Reactome pathway units we explain "only" 1.48% (GO) and 1.45% (Reactome) of the total variance for liver, as well as 1.44% (GO) and 1.93% (Reactome) for the adrenal tissue (Table 1).
Considering the transcript and gene levels, in the liver, M1 and M2 point at the significant effects of the PNKD and TRIP12 genes. PNKD exhibits a significant effect of a transcript ENSRNOT00000046229, which is approximately 2 23 times higher expressed in the control than in the heat-stress group. However, three of the five transcripts of the gene show higher expression in the heat-stress group, reaching 2 5 higher expression of ENSRNOT00000089580. TRIP12 synthesises one particular transcript (ENSRNOT00000022822) which is 2 22 times higher expressed in the control than in the heat-stress group. Different genes were significantly differentially expressed in the adrenal tissue. One transcript of the SUCO gene (ENSRNOG00000026542) is 2 23 times higher Table 1. Variance components estimated by the mixed models (M1-M4) expressed as the percentage of the total variance of y. σ 2 t the transcript variance, σ 2 g the gene variance, σ 2 go the GO term variance, σ 2 r the Reactome pathway variance, σ 2 y the variance of log 2 FC.   (Table 2). While moving to a functional level, we did not identify any significant GO terms differentially expressed in the adrenal tissue and a single Generic Transcription Reactome Pathway (R-RNO-212436). However, in the liver, five ontologies and one Reactome pathway were significant. Functionally, they can be grouped into ontologies

Discussion
Various scopes of information considered in our study explain different amounts of the observed variation of transcript differential expressions across the genome ( Table 1). The difference in the estimated variance components related to transcripts, genes, GO terms and Reactome pathways associates with the number of levels of those independent variables. Since we are not aware of the corresponding estimates reported by other studies, a discussion with the literature is not possible. Therefore, at this stage, we can only hypothesise that on the one Table 2. Top 5 significant transcripts from M1, top 2 significant genes from M2, significant Gene Ontology terms from M3, and significant Reactome pathways from M4. ID the Ensembl transcript ID, Ensembl gene ID Gene Ontology database ID, or Reactome ID, depending on the effect considered, Effect represents the estimate from the corresponding mixed model M1-M4, P P value corresponding to the normal probability density function with mean zero and the variance estimated by the corresponding model. www.nature.com/scientificreports/ hand, the differences in variance components can be due to the technical nature of the model, since many more transcript effects were estimated as Reactome pathway effects. On the other hand, those effects were modelled as random and thus their variation was constrained by the predefined co-variance structure so that the differences in variance components express the biological phenomenon-Reactome pathways aggregate effects of many transcript/genes. By considering the significant transcripts, genes and pathways identified in our study, a consistent picture emerges that DNA stability and its repair mechanisms are affected by heat stress. This phenomenon has already been reported by other authors, e.g. 26 and was reviewed by Kantidze et al. 27 . On a gene level, TRIP12 was reported as encoding a protein which is associated with chromatin and plays a role in the maintenance of genome integrity 28 and PER3 influences DNA damage repair by correlating with checkpoint kinase 2 gene 29 . Mutations in the SUCO gene induce hypoglycaemia [www. ensem bl. org] in humans, which is the cause of hypothermia in diabetic patients 30 . PER3 was differentially expressed in broiler chicken maintained under control and heat stress conditions 31 . In general, it has been proven that PER3 is associated with behavioural differences towards stress and was influenced by stress and ethanol treatment in BXD strain mice 32 .
On the functional level, we observed a significantly under expression of genes with ontologies related to double-strand break repair (GO:2000780 and GO:2000779) and histone ubiquitination. Furthermore, outside of the nucleus, a significant effect of the cytoplasmic translation ontology (GO:0002181) and rRNA processing in the nucleolus and cytosol pathway (R-RNO-6791226) can be linked to the phenomenon of the aggregation of proteins in the cytoplasm of yeast cells subjected to heat stress, which consequence was impaired cytoplasmic translation 33 .
Noteworthy, our list of significant transcripts, genes, GO terms, and Reactome pathways did not overlap with the results reported for the same material by Dou et al. 12 . Only the PER3 gene was significantly associated with differential expression in adrenal tissue in both studies. The most emerging difference concerned the number of significant effects reported. While Dou et al. 12 estimated 3909 and 4953 significantly differentially expressed genes for liver and adrenal tissues respectively, our study pointed at only two significant genes for each tissue. Similarly, the number of significant GO terms was 193 for liver and 79 for adrenal tissue in Dou et al. 12 , while only five and zero in our results. The observed differences are caused by the following factors: different raw reads editing criteria, different approaches to estimate transcript expression, and different statistical modelling of expression data. Dou et al. 12 used the Cufflinks software 34 for bioinformatic processing of the expression data, which implements Tophat2 35 for the alignment of reads to the reference genome and a single gene hypothesis test with multiple testing correction of P values via the FDR for the assessment of differential expression. In our analysis, an alignment-free approach and a random effect model incorporating all transcripts/genes simultaneously were applied. Already 36 pointed at differences in the statistical inference based on fixed and random effect models, indicating that the former tend to underestimate, and the latter-overestimate a residual variance. As a consequence, a type I-error is often elevated in fixed-effect models, which was analytically demonstrated by 37 in the context of meta-analysis. Moreover, significant differences in the quantification of expression between alignment-based and alignment-free approaches were recently demonstrated in our unpublished analysis of Sus scrofa RNA-seq data by Hoffman et al.

Conclusions
Mixed models, i.e. statistical models fitting random effects, are a valuable tool for the analysis of high-throughput biological data. Their major advantages comprise: (1) the possibility to incorporate information on covariance between observations, which is often neglected while applying simple, fixed effect models, and (2) circumventing the problem of multiple testing, by simultaneous fitting all effects. We see the major limitation of the proposed approach in the varying quality of functional annotation of genomes available for different species. While genomes of humans and experimental species, such as rats, are very well functionally annotated with the most of transcripts/genes assigned to GO terms and metabolic pathways, less well-studied species have less complete annotation, which would enforce the incorporation of phantom GO and phantom pathway effects (similarly to phantom parent groups in livestock genetic evaluation).
From the biological perspective, PER3 and SUCO genes as well as DNA repair and translation were indicated as factors playing a significant role in heat stress response.

Data availability
The datasets generated and/or analysed during the current study are not publicly available due to institutional constrains, but are available from the Yachun Wang on reasonable request.