Performance Gains in Genome-Wide Association Studies for Longitudinal Traits via Modeling Time-varied effects

Complex traits with multiple phenotypic values changing over time are called longitudinal traits. In traditional genome-wide association studies (GWAS) for longitudinal traits, a combined/averaged estimated breeding value (EBV) or deregressed proof (DRP) instead of multiple phenotypic measurements per se for each individual was frequently treated as response variable in statistical model. This can result in power losses or even inflate false positive rates (FPRs) in the detection due to failure of exploring time-dependent relationship among measurements. Aiming at overcoming such limitation, we developed two random regression-based models for functional GWAS on longitudinal traits, which could directly use original time-dependent records as response variable and fit the time-varied Quantitative Trait Nucleotide (QTN) effect. Simulation studies showed that our methods could control the FPRs and increase statistical powers in detecting QTN in comparison with traditional methods where EBVs, DRPs or estimated residuals were considered as response variables. Besides, our proposed models also achieved reliable powers in gene detection when implementing into two real datasets, a Chinese Holstein Cattle data and the Genetic Analysis Workshop 18 data. Our study herein offers an optimal way to enhance the power of gene detection and further understand genetic control of developmental processes for complex longitudinal traits.

increase of QTN heritability. Especially when the QTN heritability reached 2%, the powers were very close to 100% for all methods.
As expected, our proposed models (fGWAS-C and fGWAS-F) achieved the highest power among all the models employed under all scenarios except when = . h 0 1% QTN 2 , where the GWAS-EBV-NP and GWAS-DRP-NP models achieved a higher power at the cost of high FPRs. It should be pointed out that the GWAS-Residual model also obtained relatively higher power even under the relatively conservative circumstance. For our two models, the model fGWAS-C achieved more power than fGWAS-F. The advantage of fGWAS-F was that it could test additive and dominant effect simultaneously. But it could lose power for testing the additive effect in some degree. Interestingly, we found that the GWAS-EBV-P gained more power than GWAS-DRP-P in our simulation.
We also evaluated powers of all models using empirical thresholds based on different FPRs for null-effect SNP. The receiver operating characteristic (ROC) curves plotting the statistical powers against FPRs were shown in Figures S1 and S2. The curves indicated that the fGWAS-C model performed best at all levels of QTN heritabilities, and fGWAS-F model was the second best except it achieved a slightly lower power than GWAS-Residual model at a lower QTN heritability (0.1%).
Furthermore, we discovered that the p-values (− p log ( ) 10 ) between our two proposed models as well as the p-values for the GWAS-EBV-P and GWAS-DRP-P models were strongly correlated (r > 0.9) (Fig. 3). This indicated that these two pairs of models could lead to similar orders of p values, respectively.
Estimation accuracy of functional QTN effect. The average of estimated cumulative additive effect (see equation S3 in Supplementary Methods for calculation) or estimated cumulative effect of the QTN for different models and their corresponding standard deviations (SD) and root-mean-square errors (RMSE) were summarized in Table 1. As no dominant effect was simulated for the causal QTN, the cumulative dominant effect predicted by fGWAS-F was very close to zero (Table S1). In the simulation, the true cumulative QTN effect was fixed at 175. 21. It could be seen that the fGWAS-C and fGWAS-F models achieved the most accurate estimate of the QTN effect regardless of QTN heritability, while the other models always underestimated the true QTN effect in different degree. Meanwhile, the standard deviations of the cumulative effect estimated by all the models decreased as the QTN heritability increased except the GWAS-Residual model, which implied that a more accurate estimation of QTN effect could be realized at a higher QTN heritability. The root-mean-square-errors of our two proposed models were always the smallest across all models for each QTN heritability scenario, and they were very close to the corresponding standard deviations for these two models. Furthermore, the average additive genetic effect curves across the 1,000 replicates predicted by the fGWAS-C and fGWAS-F models shared perfect concordance with the true curves (Fig. 4).
Chinese Holstein cattle data. We used Akaike information criterion 34 (AIC) as well as Bayesian Information Criterion 35 (BIC) to determine the orders of basis functions. After model selection with AIC and BIC values, the model with a fifth-order basis functions for population mean, a third-order for additive genetic effects and a fifth-order for permanent environmental effects was best fit to the data for all the three traits (Table S2). Manhattan plots of − p log ( ) 10 for milk yield (MY), fat percentage (FP) and protein percentage (PP) by the fGWAS-C and fGWAS-F models were shown in Fig. 5. For the three traits of Chinese Holstein cattle population, we found 215 genome-wise significant SNPs in total by our fGWAS-C and fGWAS-F models ( Figure S3A). Among the 215 SNPs, 179 were commonly detected by both methods, while 33 and three were solely detected by fGWAS-C and fGWAS-F, respectively ( Figure S3B). The results indicated that fGWAS-C and fGWAS-F shared perfect concordance, and fGWAS-F could lose power in some degree. Furthermore, 11 of these 215 SNPs, located in a relatively narrow segment (from 1.65 to 4.36 MB) of chromosome 14, were discovered to affect all the three traits. The well-known DGAT1 (diacylglycerol O-acyltransferase 1) gene, reported to be a major gene affecting milk production traits 36 , is located within this region.
For milk yield, 3 of 17 significant SNPs were located on chromosome 5, 12 and 15, respectively, while all the remaining SNPs were located between 1.48 and 4.36 MB of chromosome 14. For fat percentage, 126 significant  10 ) using different GWAS models at QTN heritability of 1.0% for the simulation study. Scatterplots of − p log ( ) 10 for any two GWAS models were shown at the upper triangular, with Pearson correlation coefficients listed at the lower triangular. The read lines represented regression line y = x; the blue lines were the lines of best fit for − p log ( ) 10 of each two models.     Tables S3  through S5. The top significant SNP for the three traits was SNP ARS-BFGL-NGS-4939, which was located within DGAT1 gene region. This SNP explained 1.45%, 13.72% and 1.93% of the phenotypic variation for milk yield, fat percentage and protein percentage with the fGWAS-F model, respectively. The curves of additive effects, dominance effects and QTL heritabilities of this SNP for three traits were shown in Figure S4. GAW18 data. As higher order basis functions did not converge, the model with a second-order basis functions for all the time-varied effects was used to fit GAW18 data. Manhattan plots of p values for two traits by the fGWAS-F model were shown in Fig. 6. For systolic blood pressure, two SNPs (on Chr13) reached the genome-wide significance level. Both of them are located within the region of gene CUL4A, which participates in the biological processes including nucleotide-excision repair, DNA damage recognition and regulation of DNA damage checkpoint. For diastolic blood pressure, 6 SNPs showed the genome-wide significance. The nearest genes to these 6 SNPs are CDC42 (within), TMEM248 (within), RN7SL43P (782 bp away), VAV2 (within), UFM1 (53 kb away), and AP000959.2 (1.47 Mb away), respectively. Interestingly, both CDC42 and VAV2 genes participate in the biological process of blood coagulation, and CDC42 gene also participates in heart contraction.

Discussion
Recently, a growing number of studies indicated that the expression of genes was time-dependent [37][38][39] . In current study, we proposed two models for the GWAS of longitudinal trait which could fit the time-varied QTN effects and directly use the raw longitudinal records. This can fully avoid the necessity of transforming phenotypes into pseudo-phenotypes, such as EBVs 20 , DRPs 40 , or estimated residuals. The simulation results indicated that our proposed models could capture genetic differences varied in the entire process of the time period, thereby increasing the statistical power of QTN detection. Although pseudo-phenotypes were substitutions for longitudinal records, the scales of them would be changed 41 . Therefore, the QTN effects predicted by these pseudo-phenotypes methods were biased. This might not influence the significance test, as the scales of corresponding estimated errors would also change. However, the pseudo-phenotypes methods could not directly predict the true proportions of the phenotypic variance explained by QTNs. As our fGWAS-C and fGWAS-F models directly used raw phenotypes and achieved the most accurate estimate of the QTN effects, they could be used to predict QTN heritability in practice. Overall, the proposed random regression-based methods clearly outperformed other traditional methods validated by extensive simulations.
Among the traditional GWAS models, while no polygenetic effects were fitted to account for cryptic relationships between individuals, the GWAS-EBV-NP and GWAS-DRP-NP models resulted in high FPRs. DRPs had adjusted for parental average effect 23 . However, the cryptic relationships between individuals still existed when the EBVs were estimated from repeated measurements 22 . In the simulation study, our results indicated that GWAS-EBV-P and GWAS-DRP-P models had similar performance in controlling the FPRs. The GWAS-DRP-P models lowered the power in some degree in our simulation. This may be because the cryptic relationships were corrected twice. One is removing the parental average information from EBVs, the other is including the polygenetic effects into the GWAS model. The EBVs or DRPs had removed the environmental effects and combined the repeated phenotypic values into a single one for each individual, which resulted in a much smaller dimension of the mixed model equation, and thus was more computationally efficient and feasible. Therefore, the GWAS-EBV-P and GWAS-DRP-P models are still an appealing alternative for GWAS study of longitudinal data when the computational efficiency is the primary consideration.
Meanwhile, we applied our two models to a Chinese Holstein cattle data and the fGWAS-F model to the GAW18 data. The GWAS for milk production traits of Chinese Holstein population had been implemented by Jiang et al. with method similar to GWAS-EBV-P 20 . Here, we expanded the population size from 2,093 to 6,619. Furthermore, the Animal QTLdb had collected 4,585 QTLs (including the QTLs obtained from association analysis) for MY, FP and PP of dairy cattle since 1994, which made our study population be a suitable dataset for validating GWAS approaches for longitudinal traits. We mapped our significant SNPs to the Animal QTLdb and discovered that most of them (165/215) located within the reported QTL regions. Moreover, two novel regions (Chr5:93.49-94.65 MB for FP; Chr20: 44.16-45.87 MB for PP) contained several significant SNPs (15 SNPs for the former and 7 SNPs for the later) in relatively narrow segments, and they were potential candidate QTLs regions for milk production traits. The estimated curves of additive effects, dominance effects and QTN heritabilities of SNP ARS-BFGL-NGS-4939 (within the DGAT1 region) for milk yield, fat percentage, and protein percentage were also predicted by our fGWAS-F model. The trend of these estimated curves implied that the genetic effects were not constant, and could depend on the data or environment.
For GAW18 dataset, Chen et al. did not find any significant SNPs using admixture mapping analysis 42 , and Chung and Zou found four significant SNPs with extended EMMA 1 model 26 . These studies indicated that human blood pressure might have a complicated genetic background, and there might be no major genomic region affecting it. In our studies, totally eight significant SNPs were discovered by our fGWAS-F model, and their nearest genes participated in the biological processes of nucleotide-excision repair, blood coagulation, heart contraction and so on, which closely related to heart disease. These eight significant SNPs could be candidate associated loci for blood pressure.
Functional GWAS is not a novel conception, and has been proposed and carried out by Das et al. 43 . One of the key differences between their model and our proposed ones is that we divided the time-varied mean values for SNP genotypes into two parts, time-dependent population mean and SNP effects, instead of fitting them directly. In this way, our models can be easily implemented by the popular ASReml software. The more important difference is that time-varied polygenetic effects are fitted in our models to control the FPRs. We applied the fGWAS software by Das et al. to our simulated data. The resulted FPRs at tabulated thresholds of p value = 0.01 and 0.05 were 0.046 and 0.125, respectively. As indicated by Xu 44 in GWAS of non-longitudinal traits, the model ignoring polygenic covariance structure merely performed well for the simple experiment with one QTN. However, the signals became very noisy for complex experiment with multiple QTNs. In practical situation, quantitative traits of interest are controlled by more than one QTN 45 . Therefore, it should be beneficial to include polygenic effect in the model.
The emerging next-generation sequencing technology impels us to find the "miss heritability" of complex traits. Along with the technological evolution, the availability of public data, such as the 1000 Genomes Project and 1000 Bull Genomes Project, provides opportunities to maximize the value of our existing data though genotype imputation. The number of variants can be increased and true QTNs may be located in this way. The genotype dosages (a continuous random variable between 0 and 2) can gain more powers than "best-guess" imputed genotype (genotypes with the highest probability) in GWAS 46 . Luckily, our fGWAS-C model can also be applied to genotype dosages. This cannot be achieved by the fGWAS software by Das et al.
As expected, our proposed fGWAS-C and fGWAS-F models showed obvious computational inefficiency as the dimension of the mixed model equation was larger than other models. When the relationship matrix is established based on pedigree, the computational burden is less challenging as the numerator relationship matrix is relatively sparse. The marker-based kinship matrix can reflect the relationship between individuals more precisely. For example, the relationship among full siblings will be the same based on pedigree, but can be distinguishable with genetic markers 47 . However, a dense marker-based kinship matrix will increase the computational burden heavily. Zhang et al. suggested that a compression approach, which was called compressed mixed linear model, would decrease the effective sample size by clustering individuals into groups 3 . Meanwhile, for population with unknown degree of genetic relationship, Kang et al. developed a procedure for estimating the contribution of the polygenetic effects to the phenotypes and the polygenetic effects were not needed to fit in the GWAS model if they were tested non-significant 2 . Both approaches can be incorporated to improve our proposed models in our future endeavors.
In conclusion, we proposed two models fGWAS-C and fGWAS-F using random regression for functional GWAS of longitudinal traits on a genome-wide scale. According to our simulation study results, the proposed models fitted longitudinal traits successfully and outperformed the models using EBVs, DRPs or estimated residuals as response variables. Using our proposed models, we have successfully found two novel regions which were significantly related with milking production traits for the Chinese Holstein data and some SNPs related with blood pressure for the GWA18 workshop dataset. Generally, functional GWAS models using random regression were useful and appealing in the GWAS for longitudinal traits.

Materials and Methods
General expression of the random regression model. The general expression of random regression model can be formulated as the time-dependent function: Scientific RepoRts | 7: 590 | DOI:10.1038/s41598-017-00638-2  Here, A is the numerator relationship matrix based on pedigree information; I is the identity matrix; ⊗ is the Kronecker product; G is the variance-covariance matrix for random regression coefficients of additive polygenic effects with size of (nr 1 + 1) × (nr 1 + 1); P is the variance-covariance matrix of random regression coefficients for permanent environmental effects with size of (nr 2 + 1) × (nr 2 + 1); σ e 2 is the residual variance. Therefore, the mixed model equations can be expressed as: Statistical models for the GWAS of longitudinal data. Under the framework of the random regression model, we proposed two detection methods for the association analysis of longitudinal traits, i.e., functional GWAS model treating each SNP as the covariate (fGWAS-C), and functional GWAS model treating each SNP as the factor (fGWAS-F). To exploit the property of these two novel methods, several alternative models/strategies which used the EBVs, DRPs, or estimated residuals as the response variables for GWAS of longitudinal traits were also applied for extensive comparisons. Details of each model were specified below as well as listed in Table 2.
Scientific RepoRts | 7: 590 | DOI:10.1038/s41598-017-00638-2 Here, x i is a genotype indicator which is assigned 0, 1 and 2 for genotype aa, Aa and AA, respectively; SNP(t) represents the time-varied additive effect for each marker and can be expressed as linear regression for a set of basis functions as mentioned before: is the value of the kth basis function at time t; η k is the kth fixed regression coefficient for additive SNP effect; nf is the order of basis functions for the time-varied SNP effect. For convenience, we define the same order of time-varied population mean and SNP effect in this model.
Similarly, the fGWAS-F model is formulated as: Here, SNP l (t) means time-varied effect for genotype l (AA, Aa and aa) of each marker and λ lk is the kth fixed regression coefficient for genotype l. For fGWAS-F model, time-varied additive genetic effect, dominance genetic effect, and additive genetic variance of each SNP can be deduced as 45 : where p and q are the allele frequencies for each locus.

GWAS-EBV-P and GWAS-EBV-NP models.
Under such models, combined EBVs are firstly deduced through solving Equation (4). Therefore, the estimated additive genetic curve for individual i can be formulated as where t min and t max are the first and last recording time of all individuals, then the accumulated EBV for each individual from t min to t max can be obtained as: The accumulated EBVs are then used as the latent response variable in the GWAS-EBV-P model.
and GWAS-EBV-NP model Here, y is defined as the n × 1 vector of EBVs for all individuals; μ is the population mean; u is the vector of polygenetic effects with multivariate normal distribution MVN(0, A σ a 2 ), where A is the numerator relationship matrix and σ a 2 is the additive genetic variance; e is the vector of random residuals with a multivariate normal distribution MVN(0, I σ e 2 ); Z is the incidence matrix for polygenetic effects; a is the regression coefficient of EBVs on SNP genotypes and W is the vector of SNP genotypes coded as 0, 1, and 2.

GWAS-DRP-P and GWAS-DRP-NP models.
We used DRPs instead of deduced EBVs as potential response variable in Equations 12 and 13, and the respective models were called GWAS-DRP-P and GWAS-DRP-NP. DRPs were derived from EBVs using the method proposed by Garrick et al. which allowed for the removal of the parental average information from EBVs 23 .
GWAS-Residual model. The GWAS-Residual model uses the adjusted estimated residuals as the response variable in the GWAS analysis. This model is similar to the genomewide rapid association using mixed model and regression (GRAMMAR) model 33 , which obtains residuals of all individuals adjusted for polygenetic effects and subsequently analyzes the association between these residuals and each SNP covariate using rapid least-squares methods.
In our study, the estimated residuals for multiple records corresponding to each individual were obtained by solving equation (4). Then, we averaged the estimated residuals of multiple records for each individual as the adjusted estimated residual, which was employed as the response variable for association analysis with the model similar to GWAS-EBV-NP or GWAS-DRP-NP.
Scientific RepoRts | 7: 590 | DOI:10.1038/s41598-017-00638-2 Hypothesis tests. For each SNP, the incremental Wald statistic implemented by ASReml 48 was used to examine whether the SNP is associated with the trait. The Wald chi-squared statistic with a degree of freedom of df w is given by: Here, − R R [ (full model) (reduced model)] denotes the difference between the reduction in the sums of squares 49 (RSS) or models with and without SNP effect. The symbol df w is degree of freedom for the SNP effect.
For fGWAS-C model, df w = nf + 1, and for fGWAS-F model, df w = 2(nf + 1), where nf is the order of basis functions for the time-varied SNP effect as defined above. For other models defined above, df w = 1. The symbol σ e 2 is residual variance estimated via residual maximum likelihood (REML) method.
Simulations. We performed extensive simulations to systematically compare the performance of two random regression-based GWAS models (fGWAS-C and fGWAS-F) proposed here and other multiple-step traditional linear mixed models (GWAS-EBV-P, GWAS-EBV-NP, GWAS-DRP-P, GWAS-DRP-NP and GWAS-Residual) aforementioned. We evaluated statistical power, type-I error rate as well as the accuracy of SNP effect estimated for each GWAS method through 1,000 replication.
Population and genomic data were simulated with QMSim software 50 . The simulation started with a base population of 50 males and 50 females in generation −1,000, followed by 1,000 discrete historical generations (generation −1,000 to −1) with the same population size and an equal sex ratio. After 1,000 historical generations, the recent population was generated from generation −1 to generation 0 with population size expanded from 100 to 1,000 (500 males and 500 females). In the follow-up four recent generations (generation 1 to 4), 50 males were randomly selected from the last generation each mating with 500 females. Each female produced two offspring (one male and one female) at each recent generation. Finally, a total of 2,000 females from the last four recent generations were collected as the experimental population investigated. We determined 1,002 SNPs as the simulated genomic data, two of which were selected as independent target mutations. One contributed to the genetic variance (treated as the QTN) and the other had null effect on the longitudinal phenotype. These two SNPs were adopted to evaluate statistical powers and FPRs respectively across different models. The remaining 1000 SNPs were assigned the polygenic effects representing the genetic background of each individual, all genotypes of which were then removed in the final simulated data.
The longitudinal phenotype observations were simulated using self-developed C program. The detailed description was given in Supplementary Methods. In the simulation, heritability of simulated trait h 2 was set to 0.3 and heritability of functional QTN h QTN 2 (the proportion of phenotypic variance explained by the QTN) was set to different levels of 0.1%, 0.5%, 1% and 2%. The variances explained by the polygenetic and permanent environmental effect were scaled to achieve the preset heritability of the simulated trait.
In the simulation, the power and type-I error rate for each scenario were determined as the proportion of significant detections for functional QTN and null-effect SNP respectively among 1,000 replicates for each scenario.
Real data analysis. Two real datasets, a Chinese Holstein cattle data and the Genetic Analysis Workshop 18 (GAW18) data, were used to further validate performance of our proposed models. The detailed illustrations of the real data was provided in Supplementary Methods. For simplicity and conciseness, we merely employed the fGWAS-F model in the real data of longitudinal traits. Legendre polynomials 51 were used as the basis functions for the overall mean, additive genetic effect and permanent environmental effect. The orders of basis functions were evaluated based on the smallest AIC as well as BIC. As the effects contributed by most of SNPs were very small, we adopted the same variance components estimated by the reduced model of equation (1), which was then applied to the full GWAS model for testing each marker. This was similar to the strategy of Kang et al. 2 and Zhang et al. 3 . The Bonferroni correction was used to control false-positive rates for Chinese Holstein cattle data.