Differential gene expression analysis based on linear mixed model corrects false positive inflation for studying quantitative traits

Differential gene expression (DGE) analysis has been widely employed to identify genes expressed differentially with respect to a trait of interest using RNA sequencing (RNA-Seq) data. Recent RNA-Seq data with large samples pose challenges to existing DGE methods, which were mainly developed for dichotomous traits and small sample sizes. Especially, existing DGE methods are likely to result in inflated false positive rates. To address this gap, we employed a linear mixed model (LMM) that has been widely used in genetic association studies for DGE analysis of quantitative traits. We first applied the LMM method to the discovery RNA-Seq data of dorsolateral prefrontal cortex (DLPFC) tissue (n = 632) with four continuous measures of Alzheimer’s Disease (AD) cognitive and neuropathologic traits. The quantile–quantile plots of p-values showed that false positive rates were well calibrated by LMM, whereas other methods not accounting for sample-specific mixed effects led to serious inflation. LMM identified 37 potentially significant genes with differential expression in DLPFC for at least one of the AD traits, 17 of which were replicated in the additional RNA-Seq data of DLPFC, supplemental motor area, spinal cord, and muscle tissues. This application study showed not only well calibrated DGE results by LMM, but also possibly shared gene regulatory mechanisms of AD traits across different relevant tissues.


DGE in the discovery data of DLPFC tissue
We compared the results obtained from the DGE analyses of continuous cognitive decline and three AD-NC traits, by using four methods: LMM, standard linear regression model, robust regression, and Voom with quantitative traits dichotomized by their medians (see Methods).Our DGE analyses adjusted for various covariates, including sex, age, postmortem interval (PMI, time span between donor's death and tissue harvest) and study group (ROS or MAP) in both models (Table 1).We constructed Quantile-Quantile plots (QQ-plots) to visualize the DGE p-values of all test genes per trait, and calculated the genomic control factor . 26 As depicted in Fig. 1, LMM-based test method well calibrated false positive rates (with ∼ 1 ) for all traits (First Row in Fig. 1) with the discovery RNA-seq data of DLPFC tissue, while the standard linear regression method resulted in seriously inflated false positive rates associated (with > 5 ) for all four traits (Second Row in Fig. 1).High inflated false positive rates were also observed in the results of all four traits with the discovery data by using the robust regression method (First Row in Supplemental Fig. 1) with > 3 , and by using the Voom method (First Row in Supplemental Fig. 2) with > 2.
We identified a total of 37 potential statistically significant genes with differential expression (p-values < 0.0001 for at least one trait) by the LMM-based test method, including 2 for cognitive decline, 4 for β-amyloid associated genes, 2 for tangle density, and 4 for global AD pathology burden, with p-values less than the Bonferroni corrected significance threshold of 3.49 × 10 -6 (Table 2; Figs. 2, 3, 4).
Importantly, many of the potential significant genes associated with cognitive decline and AD-NC traits identified by the LMM method have been reported in prior studies.This confluence of findings bolsters the credibility of our outcomes.Notably, for example, gene MEIS3 (P-value = 1.16 × 10 -8 for cognitive decline) and gene NPNT (p-value = 1.49× 10 -6 for tangle density) have previously identified as differentially expressed genes in the context of AD in earlier studies 27,28 .Likewise, Gene DDAH2 (p-value = 4.18 × 10 -7 for global AD pathology

DGE in the replication RNA-Seq datasets
We applied the LMM, standard linear regression, robust regression, and Voom methods to conduct DGE analyses of the same cognitive decline and AD-NC traits, using additional replication RNA-Seq data of DLPFC (n = 588), SMA (n = 234), spinal cord (n = 232), and muscle (n = 268) tissues (Supplemental Tables 1-4).Confounding covariates such as sex, age, and postmortem interval were adjusted for in all analyses.Study group (ROS or MAP) was only adjusted in DLPFC datasets as participants of SMA, spinal cord, and muscle tissues are all from MAP. QQ-plots (Supplemental Figs.1-8) still showed that LMM-based test results with these validation datasets were better calibrated than those obtained by standard linear regression, robust regression, and Voom, especially for studying the validation data of DLPFC (Supplemental Figs.1-3).Thus, we only present the validation results obtained by LMM method here.
With validation RNA-Seq data of DLPFC, we replicated 10 of these 37 potential significant genes identified in the discovery analyses with validation p-values<1.35× 10 −3 for either cognitive decline or at least one of the AD-NC traits (Table 3).For example, PLCD3 differentially expressed for β-Amyloid and global AD pathology was replicated with P-value = 5.42 × 10 -5 for global AD pathology; TRIP6 differentially expressed for global AD pathology was replicated with p-value = 6.34 × 10 -4 for global AD pathology; PLCE1 differentially expressed for β-Amyloid, tangle density, and global AD pathology was replicated with p-value = 4.51 × 10 -4 for global AD pathology.
Since the validation RNA-Seq datasets of the motor function related tissues (SMA, spinal cord, muscle) have sample sizes of only ~ 100 (Supplemental Table 3), we used a more liberal p-value threshold (nominal p-value < 0.05 for either cognitive decline or at least one of the AD-NC traits) to identify replicated genes.As a result, for the replicated differentially expressed genes in the CNS tissues, we found 8 in SMA with 5 overlapped in DLPFC, 2 in spinal cord, 3 in muscle with one overlapped in spinal cord, and one overlapped in SMA (Table 4).For example, ALDH6A1 differentially expressed for global AD pathology in the discovery data was replicated with P-value = 0.008 for cognitive decline in SMA and muscle.Additionally, HRSP12, a differentially expressed gene related to cognitive decline in the discovery analyses was replicated with significant p-values < 0.0001 in SMA.Interestingly, several differentially expressed genes including ADAMTS2 for cognitive decline, NPNT for all four traits, RERG for β-Amyloid and global AD pathology, as well as MEIS3 for cognitive decline and tangle density that were identified in the discovery data were replicated in the validation datasets of DLPFC and CNS tissues.It is noteworthy that replicated gene ADAMTS2 in both DLPFC and SMA tissues was suggested to be a therapeutic target for AD 32 , while replicated gene HRSP12 in SMA is also known by its alias RIDA, which was found as a GWAS risk loci for blood protein levels by previous studies 33 .
To further illustrate the reason why differentially expressed genes in the DLPFC tissue could be replicated in the SMA, spinal cord, and muscle tissues, we created correlation heatmaps of the gene expression levels of these validated genes.For each trait, we sorted samples based on their trait values and divided sorted samples www.nature.com/scientificreports/into ten equal parts (i.e., decile).For each replicated gene, we calculated the average gene expression level of the discovery DLPFC and replication tissues, for samples in each decile of the discovery and replicated traits.
Then we calculated the correlations between these two vectors of average gene expression.Heatmaps of these correlations (Supplemental Figs.9,10) show that most correlations are > 0.15, demonstrating that these replicated differentially expressed genes are not tissue specific, but likely to be shared across these motor function related tissues.For example, differentially expressed genes ADAMTS2 and HRSP12 that were replicated with all four traits in SMA all have gene expression correlations > 0.15.
In conclusion, our analysis revealed that all 5 differentially expressed genes (NPNT, ALDH6A1, RERG, PLCD3, MEIS3) with Bonferroni corrected significant p-values in the discovery data were replicated in the validation data of DLPFC tissue.Furthermore, we validated a total of 17 unique genes across the validation RNA-Seq data of DLPFC and three motor function related tissues, including 10 in DLPFC, 8 in SMA, 2 in spinal cord, and 3 in muscle.The identification of shared differentially expressed genes in two different tissue types, such as DLPFC and SMA, DLPFC and non-neural muscle, as well as spinal cord and non-neural muscle outside the brain, suggests a possible shared molecular mechanism between motor and cognition functions.
Table 2. LMM P-values of 37 potential DGEs (p-values < 0.0001) identified by LMM using the discovery ROS/MAP RNA-Seq data of DLPFC, for at least one trait of the cognitive decline and three AD pathologies.a Significant DGEs with Bonferroni correction (p-value < 3.49 × 10 -6 ).*Indicating the corresponding trait (columns 3-6) for which the potential DGE was identified (P-values < 0.0001).

Pathway enrichment analysis
To illustrate the underlying pathways and biological functions of our identified differentially expressed genes of all 4 AD traits.We selected top 100 significantly differentially expressed genes identified by using the discovery RNA-Seq data of DLPFC tissue for each of the 4 traits to conduct pathway enrichment analyses by pathDIP 34 .Databases of NetPath 35 , Panther Pathway 36 , and Spike 37 were used in the enrichment analyses.Significant enrichment in several biological pathways were identified with the top 100 significantly differentially expressed genes of cognitive decline and tangle density (Fig. 5).These significant pathways were reported by previous studies to be relevant with AD.For example, for the pathways significantly enriched with top 100 differentially expressed genes of cognitive decline (Fig. 5A), the Thyrotropin releasing hormone (TRH) receptor signaling pathway (FDR = 3.54 × 10 -2 ) has been associated with aging and neurodegenerative diseases, such as Alzheimer's disease and Parkinson's disease 38 .Similarly, the 5HT2 type receptor mediated signaling pathway (FDR = 4.37 × 10 -2 ) could influence the www.nature.com/scientificreports/behavioral and psychological symptoms of dementia (BPSD) in Alzheimer's disease (AD) 39 .The Oxytocin receptor medicated signaling pathway (FDR = 4.37 × 10 -2 ) was suggested to be a novel protective target for vascular dementia and mixed dementias 40 .The toll-like receptor (TLR) signaling pathway (FDR = 1.42 × 10 -2 ) may be involved in clearance of amyloid β-protein (Aβ) in the brain making it a potential therapeutic target for AD 41 .
The Renin-Anigotensin System (RAS) and tumorigenesis pathway (FDR = 1.52 × 10 -2 ) is known to play a key role in interacting with pathophysiological mechanisms of AD 42 .Several evidences suggest that enhancing Wnt pathway (FDR = 1.88 × 10 -2 ) can boost synaptic function during aging, and ameliorate synaptic pathology in AD which could be novel therapeutic for restoration in the brain 43 .The Epidermal growth factor receptor (EGFR1) pathway (FDR = 2.01 × 10 -2 ), a preferred target for treating memory loss induced by amyloid-beta (Aβ) 44 , is also enriched in β-amyloid.Also, for the pathways significantly enriched with top 100 differentially expressed genes of tangle density (Fig. 5B), Death-Associated Protein Kinase 1 in DAPk family (FDR = 5.98 × 10 -3 ) that plays a critical role in deregulation in AD thus manipulating DAPK1 activity and/or expression could be a promising drug target in AD 45 .The additional protective mechanism of AndrogenReceptor (FDR = 3.92 × 10 -2 ) might enhance neural health and deter the progression of AD 46 .

Discussion
Most existing DGE analysis methods [6][7][8][9]11 are developed for dichotomous traits with small sample sizes. Howver, with increased sample sizes in RNA-Seq datasets, there is a huge demand for methods for studying quantitative traits in population-based RNA-Seq studies.As shown by the MACAU 11 method paper, incorporating a mixed term into DGE analysis can help to control for false positive rates in RNA-Seq studies.In this study, we develop an analytic pipeline for implementing the GEMMA tool 22 , enabling the DGE analysis of quantitative traits by LMM, and apply to real ROS/MAP RNA-Seq datasets of DLPFC, SMA, spinal cord, and muscle tissues for studying continuous cognitive decline and AD-NC traits.The pipeline is freely available from https:// github.com/ tangj iji19 9645/ LMM_ DGE_ Pipel ine.
Our application studies found that DGE analyses results obtained by LMM-based tests were all well calibrated for false positive rate, especially in our discovery RNA-Seq data of DLPFC, while the DGE results obtained by the alternative standard linear regression, robust regression, and Voom methods all have inflated false positive rates.A list of 37 potential differentially expressed genes were identified by LMM in the discovery data, and 17 of these were replicated in the additional RNA-seq data of DLPFC, SMA, spinal cord, and muscle tissues.AD     www.nature.com/scientificreports/relevant biological pathways were also found to be enriched with top differentially expressed genes of cognitive decline and tangle density.However, the LMM-based test still has several limitations for studying RNA-Seq data.First, the LMM assumes normal distributions for both the response variable and covariates, while the raw RNA-Seq data are read counts per gene.Even after log2 transformation for the raw RNA-Seq read counts, the transformed quantitative gene expression level per gene may not be normally distributed 47 , which could lead to a biased estimation of effect sizes.One might apply both the MACAU (mixed Poisson model-based test that is suitable for modeling RNA-Seq read counts) and our LMM analytic pipeline for DGE analysis and examine the results by QQ-plots.Second, the effect of the gene expression on the quantitative trait of interest may be heterogeneous, with effect sizes varying across the quantiles of the quantitative trait.The LMM-based method only tests the association between gene expression and quantitative trait of interest in expectation, ignoring the possible heterogeneous effects on different quantiles of the quantitative trait 48 .Therefore, further studies are needed to develop a DGE method based on quantile regression with a mixed effect term to account for the possible heterogeneous effect of gene expression across all the quantiles of the quantitative trait of interest, while controlling for false positive rates.
Overall, we provide a useful LMM pipeline for conducting DGE analysis with quantitative traits and large sample sizes, which is shown well calibrating for false positive rates in real studies.Our real application studies not only demonstrated the effectiveness of the LMM approach for DGE analysis, but also identified a list of differentially expressed genes for cognitive decline and AD-NC traits in DLPFC that were validated in DLPFC, SMA, spinal cord, and muscle tissues.Our findings have important implications for understanding the underlying biological mechanisms of the continuous AD traits of cognitive decline and neuropathologic changes, and may provide insights into the development of new therapeutic approaches for AD.

RNA-Seq data normalization
Preprocessing and normalization of raw read counts is a critical step in DGE analysis.Generally, samples with total mapped reads < 10 million are suggested to be excluded, and genes with expression levels < 0.1 transcript per million (TPM) in > 20% samples are also suggested to be excluded.We use DESeq2 6 to normalize raw RNA-Seq data.
Let K ij denote the read count for gene j and sample i, following a negative binomial distribution with mean µ ij and dispersion α j given by the following formulas: The normalization factor s ij is assumed to be shared per sample, s ij = s i , and s i is estimated by the median (across all genes) of the ratios of raw read count per gene and its corresponding geometric mean K R i (across all samples) as in the following formula 49,50 : Normalized read counts x ij are given by K ij s ij and then log2 transformed with an offset of 1 and taken as the test variable in the LMM, standard linear regression, robust regression, and Voom methods.

Standard linear regression model for DGE analysis of quantitative traits
As proposed by previous study 21 , testing if gene j with expression levels X j (normalized and log2 transformed) is differentially expressed with respect to a quantitative trait Y n×1 can be done based on the following standard linear regression : where n is the number of test samples; W is a n × c covariate matrix; α is a c × 1 vector of covariate effects includ- ing the intercept; β j is the effect size of gene j; and ǫ denotes the error term following a Multivariate Normal distribution (MVN).The DGE analysis is to test the null hypothesis of H 0 : β j = 0vs.H a : β j � = 0 , which can be conducted by using the Wald test statistic, with the maximizing likelihood estimator β j and its standard error se( β j ).

Robust regression for DGE analysis of quantitative traits
Robust regression 21 assumes the same model as the standard linear regression model ( 4), yet it furnishes robust coefficient estimates when the test samples contain influential outliers that could heavily impact standard linear (1) where k is the tuning constant (often taken as 1.345 σ ); ρ(e) is Huber's objective function; and w(e) is Huber's weighted function.The weighted least squares estimate with sample weights given by w(Y i − W i α − X i,j β j ) will be iteratively calculated given coefficient estimates from last iteration, until a stopping criterion is met.
As described in the previous study that proposed using the robust regression for DGE analysis 21 , the R packages of "rlm" and "sfsmisc" can be used to conduct the statistical test of H 0 : β j = 0vs.H a : β j � = 0 .The robust regression method is expected to provide more reliable estimates of β j when outliers are present in the test samples.

Voom for DGE analysis
Since the Voom method 10 is developed for detecting differentially expressed genes between two or more conditions, the quantitative trait Y n×1 needs to be dichotomized for testing if gene j with expression levels X j (normalized and log2 transformed read counts) is differentially expressed.Generally, the quantitative trait is dichotomized to Y d n×1 by taking the median as a cut-off.That is, if Y d i is greater than the median, it is assigned a value of 1, otherwise Y d i is assigned a value of 0. The Voom method assumes the following model: where W is a n × c matrix of confounding covariates; α is a c × 1 vector of covariate effects including an inter- cept term; β j is coefficient of gene j representing log2-fold-changes between two conditions of the dichotomous trait.The same hypothesis with H 0 : β j = 0 vs.H a : β j � = 0 is tested by Voom.Different from the ordinary least squared estimates based on (10), the Voom method robustly estimates the mean-variance relationship of the log2 transformed read counts, generates a precision weight for each sample, and enters these into the limma empirical Bayes analysis pipeline 10 .

LMM by GEMMA
To test if gene j with expression levels X j (normalized and log2 transformed read counts) is differentially expressed with respect to a quantitative trait Y n×1 with n samples in DGE analysis, the following LMM is assumed: where W is a n × c matrix of confounding covariates; α is a c × 1 vector of covariate effects including an intercept term; β j is the effect size of gene j; Z is a n × n loading matrix which is taken as an identify matrix for DGE analy- sis; ε is a n × 1 vector of independent errors following a normal distribution with mean 0 and variance τ −1 ; u is a n × 1 vector denoting random effects of all samples following a Multivariate Normal distribution with mean 0 and variance-covariance matrix γ τ −1 M ; γ is the ratio of variance components between random effects and errors; M is a n × n sample-sample correlation matrix with all gene expressions; and I n is an n × n identity matrix.GEMMA tool 22 is a C++ programed software that can be used to conduct tens of thousands of Wald test for H 0 : β j = 0(j = 1, . . ., p) .Under the above LMM, GEMMA efficiently calculates the REstricted Maximum Likelihood (REML) estimates of γ , β j , and the standard error of β j .Although GEMMA is originally developed for genome-wide association study to test the association between a genetic variant and a quantitative trait, it can be applied to DGE if both genetic variant and log2 transformed gene expression follow normal distributions.
We demonstrate the feasibility and effectiveness of conducting DGE analyses using the LMM method through application studies with the ROS/MAP data 17 .Our analyses were conducted in two steps.First, we normalized raw RNA-Seq data using DESeq2 6 .Second, we tested DGE using the LMM method as implemented in the GEMMA tool 22 .

ROS/MAP data
The Religious Order Study (ROS) and the Rush Memory and Aging Project (MAP) are two prospective community-based harmonized cohort studies of aging, which recruit senior individuals without known dementia at study entry 52 .All ROSMAP participants agree to structured annual clinical testing and autopsy and brain donation upon their death.RNA-Seq data (DLPFC, SMA, spinal cord, and muscle tissues) and AD pathologies were profiled from decedents.Both studies were approved by the Institutional Review Board of Rush University Medical Center, and all participants signed informed and repository consents and an Anatomic Gift Act.

Figure 2 .
Figure 2. Volcano plots of DGE results by LMM results by LMM with the discovery RNA-Seq data of DLPFC tissue of cognitive decline (A), tangle density (B), β-amyloid (C), and global AD pathology burden (D).Genes with effect size beta > 0.05 or < − 0.05 (vertical red lines) and p-values < 0.05 (horizontal red line) were colored.Blue points were down regulated genes and red points were up regulated genes.Top five significant up and down regulated genes were labeled.

Figure 3 .
Figure 3. Manhattan plots of DGE p-values by LMM with the discovery RNA-Seq data of DLPFC tissue of cognitive decline (A) and tangle density (B).Top five significant DGEs were labeled.Red line indicates the significant threshold 3.49 × 10 -6 with Bonferroni correction and blue line indicates p-value = 0.0001.

Figure 4 .
Figure 4. Manhattan plots of DGE p-values by LMM with the discovery RNA-Seq data of DLPFC tissue of β -amyloid (A) and global AD pathology burden (B).Top five significant DGEs were labeled.Red line indicates the significant threshold 3.49 × 10 -6 with Bonferroni correction and blue line indicates p-value = 0.0001.

Table 4 .
LMM P-values of 11 replicated DGEs (p-value < 0.05) using the validation ROS/MAP RNA-Seq data of SMA, spinal cord and muscle tissues.a Significantly replicated DGEs with p-value < 0.001.*Indicating the corresponding trait (columns 4-7) for which the potential DGE was replicated (p-value < 0.05).

Figure 5 .
Figure 5. Significant pathways with FDR < 0.05 that are enriched with top 100 differentially expressed genes of cognitive decline (A) and tangle density (B) with the discovery RNA-Seq data of DLPFC tissue.

Table 1 .
Clinical and postmortem characteristics of the discovery analytic cohort.
51 under H 0 , regression estimates.Different from the standard linear regression method where each sample contributes equally to the ordinary least squared estimation of regression coefficients, robust regression incorporates Huber's M-estimation51that is obtained by minimizing the following objection function through a numerical method called iteratively reweighted least squares (IRLS):