Genome-wide Analysis of Large-scale Longitudinal Outcomes using Penalization —GALLOP algorithm

Genome-wide association studies (GWAS) with longitudinal phenotypes provide opportunities to identify genetic variations associated with changes in human traits over time. Mixed models are used to correct for the correlated nature of longitudinal data. GWA studies are notorious for their computational challenges, which are considerable when mixed models for thousands of individuals are fitted to millions of SNPs. We present a new algorithm that speeds up a genome-wide analysis of longitudinal data by several orders of magnitude. It solves the equivalent penalized least squares problem efficiently, computing variances in an initial step. Factorizations and transformations are used to avoid inversion of large matrices. Because the system of equations is bordered, we can re-use components, which can be precomputed for the mixed model without a SNP. Two SNP effects (main and its interaction with time) are obtained. Our method completes the analysis a thousand times faster than the R package lme4, providing an almost identical solution for the coefficients and p-values. We provide an R implementation of our algorithm.

per individual. This allows the use of our fast GWAS algorithm 8 to obtain an approximate p-value for the interaction between SNP and time.
Here, we present a new algorithm for Genome-wide Analysis of Large-scale Longitudinal Outcomes using Penalization (GALLOP) which swiftly computes coefficients and p-values for cross-sectional and longitudinal SNP effects. To arrive at an almost exact solution we exploit several properties of the model. The effect of a SNP generally is (very) small. We estimate the variances in the mixed model without any SNP and assume that they will not change when a SNP is added. This assumption will lead to conservative p-values in case of non-zero SNP-effects. The magnitude of this imprecision is explored in the Results section. Using the equivalence between a mixed model and penalized least squares, a large system of linear equations can be set up. This system is very sparse (it contains many zeros) and only the last rows and columns change from SNP to SNP. With careful organization of the computations a solution is obtained very quickly. No special programming tricks are needed, our program (about 85 lines) is written in pure R and achieves a speed-up by three orders of magnitude, compared to brute-force application of lme4. Thanks to the sparseness of the equations, memory use is modest.
Quick access to SNP data is crucial and we also discuss it. An R implementation of GALLOP algorithm is provided. Simulated and real data are used to illustrate performance.

Results
Two characteristics of our method are of main interest: high speed and accuracy as compared to lmer function in the R package lme4. We assessed them via a simulation study and using real data.
( ) D 1 02 0 2 1 and measurement error σ = 2.5 • SNPs drawn from a uniform distribution between 0 and 2 Data sets used to evaluate computation times were generated in a similar manner with sample sizes varying from 1k to 10k with increment of 1k. For each sample size 1000 SNPs were analyzed to summarize computation time.
Results of the simulation study assessing computation times are shown in Fig. 1. The speed-up is linear in the number of individuals. For a genome-wide association study with 5000 individuals our methods finishes the analysis a thousand times faster. A genome-wide scan for 1 million SNPs of a phenotype, collected on 5000 individuals measured on 4 occasions, takes about 30 minutes, instead of 3 weeks when using the package lme4. The speed-up depends also on k. This is mainly attributed to the fact that lme4 requires expansion of the SNP vector.
Results of the simulations exploring accuracy are summarized graphically. Based on our theoretical derivations described in the Methods section we know that a non-zero main SNP effect affects the approximation of the variance of the random intercept. Similarly, the size of the interaction influences the variance of the random slope. On the other hand, genome-wide association studies typically show only very small SNP effects which barely contribute to the improvement of the goodness of fit. We ran simulations to explore the practical dangers and consequences of using the approximate variances. Despite the difficulties in defining the variance explained in mixed models we used a simple definition quantifying predictive power as the ratio = − || − || || − || R y y y y 1 / 2 2 2 , where ŷ stands for the fitted values and y for the average of y. The estimates are very accurate throughout the entire range of observed values (Fig. 2). The standard errors are somewhat overestimated for the larger values of β, which is expected as variances of random effects are inflated due to omitted SNP effects. However, the main interest in GWAS always lies in p-values (Fig. 3). These are almost exact (and never too optimistic) in the common GWA-range (0 < −log 10 (p) < 7). That eliminates the danger of finding too many false positive results. Due to overestimated standard errors, the −log 10(p) for larger betas are too pessimistic. Nevertheless, they increase monotonically with larger effect sizes, just with bias downward with respect to the −log 10 (p) from lmer. This loss of power can be solved by lowering the threshold for "GWAS significance" and repeating the analysis for promising SNPs with the correct model. In our simulation study, to find all SNPs for which −log 10 (p lmer ) > 7.3 we had to use the threshold −log 10 (p GALLOP ) > 7.05. In our simulation study the maximum contribution to R 2 of the SNP effects around 6%.
To confirm the accuracy of GALLOP on real data, we used the BMD data from the Rotterdam Study 9 . Details on the longitudinal BMD data set are provided in ref. 7 . For this analysis we used SNP data imputed according to the 1000 Genomes Project, which were stored per (part of) a chromosome as DatABEL files. To test our algorithm we used one of the files, which contained 97384 SNPs. We performed the association analyses with three methods: GALLOP, CTS, and lmer (only for 20 K SNPs). Comparison between p-values is shown in Fig. 4. CTS approach gives a good approximation of the p-values for longitudinal SNP effect, which coincide with our previous results on the real and simulated data. However, p-values from GALLOP are basically exact for main and longitudinal effect, irrespective of minor allele frequency. The analysis took 3.5 minutes for GALLOP, 40 seconds for CTS and 48 hours (extrapolated time based on the 20 K SNPs) for lmer, respectively.

Discussion
We presented a new algorithm for fast genome-wide analysis with longitudinal data. Our method runs a thousand times faster than lme4, which is the fastest option in R. This speed-up is achieved by combining an accurate approximation with a careful implementation. We showed that our method provides practically exact results. In case of doubt one can always do a full mixed model analysis for each of the most significant SNPs. Generally this is a small number; in case of BMD data 6 genotypes for any MAF reached threshold of −log 10 (p) > 7; so the extra computation time is negligible.  Our previous approach, conditional two-step (CTS) method combined with semi-parallel regression, computes p-values for the interaction effect about 15 times faster than the GALLOP. However, for CTS, SNP data access is still a bottleneck, 85% of the analysis time is spent on data access (Fig. 5). The genome-wide analysis of the BMD data was completed 5 times faster with CTS than with GALLOP. In case of very massive genome-wide analysis one could consider running CTS to filter out the least significant SNPs and proceed with GALLOP for more precise results.
GALLOP converts a genome-wide analysis with a longitudinal phenotype from a taxing multi-computer task to a job that can be run overnight on a single everyday computer. However, this is only true if access to the SNP data is fast enough. The memory limit in R depends on available RAM, but will usually not be larger than several gigabytes. The size of SNP data, even when split per chromosome, will exceed that size. GALLOP needs quick access to reasonably sized data blocks with multiple SNPs for all individuals. This is possible only when array-oriented binary files are used to store genotypes. We discussed this problem in detail, and proposed solutions in our previous work on fast analyses of cross-sectional outcomes 8 .
For correcting population stratification, in cross-sectional GWAS with possibly related individuals, mixed models are well established. Several algorithms have been proposed and implemented performing fast mixed model analysis in this framework. Multiple publications have proposed that this type of mixed models can be tweaked to analyze longitudinal data. Indeed, one may pretend that the repeated outcomes come from different pseudo-individuals and induce the correlation by passing the kinship matrix to the software. A quite extensive discussion on that topic is found in 10 . The author concludes that "the proper" longitudinal data analysis is to be preferred, but that it is too slow. Similarly, in ref. 11 . the authors analyzed longitudinal blood pressure data using EMMA, which tackles cross-sectional outcomes for related individuals. The authors tricked the software by mimicking an autoregressive structure in the kinship matrix. Although both papers study longitudinal data, their results touch only upon the main SNP effect. The interaction between SNP and time is not discussed.
Our algorithm assumes that the individuals are independent. An important extension is to adjust it for longitudinal data collected on related individuals, combining two types of mixed models. One approach to population stratification uses principal components of correlation matrix of the genotypes as covariates. They can be introduced as fixed effects in our model. The overhead is relatively small, because a large mixed model, without SNPs, is fitted once and each SNP is handled as a perturbation as described in the algorithm section.
The preferred approach would be to use multilevel modelling. Two sources of correlation then have to be combined: the temporal correlation between the repeated outcomes and the genetic correlation between the individuals. It would generate an additional random intercept, derived from the kinship matrix, which would destroy

Methods
A linear mixed model for a longitudinal outcome which assumes random intercepts and slopes has the following hierarchical form 12 : In (1) Y i is k i dimensional vector of responses for individual i, X i is k i × p matrix with all predictors, Z i is k i × 2 dimensional matrix with ones in the first column and t i in the second column, β is a p-dimensional vector of coefficients identical for all individuals and b i is a 2-dimensional vector containing the random effects. Measurement error is represented by the k i -dimensional vector ε i . Furthermore, D is the 2 × 2 variance-covariance matrix of random effects and Σ i is k i × k i the variance-covariance matrix of measurement error. Typically, the unknown parameters, consisting of variances, fixed and random effects, are estimated using for example Newton-Raphson algorithm. However, if the variances are known, the fixed and random effects can be estimated simultaneously by solving a penalized least squares problem given by equations: In (2) matrices X, Y, and b are build of X i ′s, Y i ′s, and b i ′s stacked underneath each other. Matrices Z and P are block diagonal with Z i and P i on the diagonals, where P i = (D/σ 2 ) −1 . System (2) is similar to Henderson's system of equations for mixed models.
A typical linear mixed model in a genome-wide association study will have a form: where t i is a k i -dimensional vector with measurement occasions, SNP i is a k i -dimensional vector with SNP values (constant over time), and C i is a k i × q-dimensional matrix with constant or time-varying additional covariates (such as height, weight, age etc.). We call model (3) a full model. Additionally, the reduced model is constructed from (3) omitting the SNP effects, as given in (4) The system of equations solving the penalized least squares problem for the reduced model have a special structure. We illustrate it for the case n= 3: i and the * distinguishes which components of the model are altered (with respect to length and/or values) due to misspecified model (4). The above system has a block structure as divided by the solid lines and can be written as with the explicit solution given by: The P * matrix can be easily obtained by fitting a mixed-effects model excluding SNP in any standard software (for example the R package lme4). The software does not explicitly return the P * , but it does return the variance-covariance matrix of the random effects (matrix D) and the variance of measurement error (σ 2 ). In R matrix P * is obtained by calling solve σ D ( / ) 2 . An additional computational simplification can be obtained by ensuring that A 22 in (7) is the identity matrix. This goal can be achieved as follows. Any system Ab = q can equivalently be solved by (KAK′)( and it is readily verified that The linearly transformed system becomes  Usually, the solution for the subject-specific effects is not of interest in GWA analyses. Nevertheless, random intercepts and slope from the reduced model can be easily obtained from the lme4 package. We add a SNP to the model, creating a border to the previous system of equations. Two effects, cross-sectional and longitudinal are added, so = * G t [ SNP SNP ] is a Σ i k i × 2 dimensional matrix, with SNP values repeated k i times for all individuals in the first column and the SNP values repeated k i times multiplied time occasions in the second column. Repeating SNP data for each individual k i times seems like a time consuming step. However, SNP is constant over time and thus G i = SNP i * Z i . In our implementation the vector replicating SNP vector k SCIENTIfIC REPORtS | (2018) 8:6815 | DOI:10.1038/s41598-018-24578-7 Implementation. GALLOP is implemented in one relatively short R program, provided in the Supplementary Materials. An important computing challenge was to avoid repeating each SNP value k times, to be able to calculate cross products in the border matrices. We achieved this by storing the basis of those matrices, calculated using Kronecker products, in a vector instead of a matrix. This way we can summarize the SNP state directly with two numbers per individual, regardless of k. Data availability. The BMD data are a part of Rotterdam Study and are confidential. The scripts used to generate the data in the simulation study are available from the corresponding author upon request.