Powerful and Efficient Strategies for Genetic Association Testing of Symptom and Questionnaire Data in Psychiatric Genetic Studies

Genetic studies of psychiatric disorders often deal with phenotypes that are not directly measurable. Instead, researchers rely on multivariate symptom data from questionnaires and surveys like the PTSD Symptom Scale (PSS) and Beck Depression Inventory (BDI) to indirectly assess a latent phenotype of interest. Researchers subsequently collapse such multivariate questionnaire data into a univariate outcome to represent a surrogate for the latent phenotype. However, when a causal variant is only associated with a subset of collapsed symptoms, the effect will be challenging to detect using the univariate outcome. We describe a more powerful strategy for genetic association testing in this situation that jointly analyzes the original multivariate symptom data collectively using a statistical framework that compares similarity in multivariate symptom-scale data from questionnaires to similarity in common genetic variants across a gene. We use simulated data to demonstrate this strategy provides substantially increased power over standard approaches that collapse questionnaire data into a single surrogate outcome. We also illustrate our approach using GWAS data from the Grady Trauma Project and identify genes associated with BDI not identified using standard univariate techniques. The approach is computationally efficient, scales to genome-wide studies, and is applicable to correlated symptom data of arbitrary dimension.


Supplementary Methods
Assumptions and Notation: We assume an inventory, test, or questionnaire with Q questions. The response to each question q (q=1,…,Q) is an ordinal response ranging from 0 to F, where F is the maximum score possible. We assume a population-based sample of N subjects have responded to the questionnaire and possess common-variant data in a target gene or region. For subject j (j=1,…,N), we define Pj = (Pj,1, Pj,2, …, Pj,Q ) as subject j's responses to the Q questions. We then define a matrix of questionnaire responses for the entire sample , which is of dimension NxQ. Finally, for subject j, we define the traditional cumulative score typically used for genetic analysis of BDI or PSS as Similarly, we define Gj = (Gj,1, Gj,2, …, Gj,V ) to be the genotypes of subject j at V SNPs, where Gj,v is coded as the number of copies of the minor allele that the subject possesses at SNP v. The SNPs included in Gj will be referred to as the "SNP set." We then construct the matrix of genotypes for the sample as , which is of dimension N x V.
Several approaches to constructing a SNP set have previously been described 1,2 . For demonstration purposes in this manuscript, we will define a SNP set as common variants (minor-allele frequency [MAF] > 5%) that fall within 2kb of a gene of interest.
GAMuT uses a KDC framework to perform a SNP-set test to test for independence between P (NxQ matrix of multivariate responses to a questionnaire) and G (NxV matrix of multivariate genotypes). After standardizing P and G, we develop an NxN questionnairesimilarity matrix Y (based on P) and a NxN genotypic-similarity matrix X (based on G). The choice of how to model pairwise similarity or dissimilarity for a set of multivariate outcomes is quite flexible. For example, for P, we can model the matrix Y using a projection matrix, as suggested by Zapala and Schork 3 , such that . We can also construct the model Y using user-selected kernel functions 1,4-6 such as the linear kernel, For genotypes G, we model its corresponding matrix X using kernel functions x(Gi, Gj) that can take the same form (e.g., linear, quadratic, Gaussian, Euclidean distance) used to construct y(Pi, Pj), although additional kernels based on identity-by-state sharing are also possible. We may wish to further augment x(Gi, Gj) to preferentially upweight the contributions of particular SNPs in G over others in the gene. For simulations reported here, we implement a weighting scheme based on the minor-allele frequency (MAF) of each assayed SNP that weights rarer variants over more common ones as described in Kwee et al 4 . Another possible SNP weight could be a measure of the strength of association between the SNP and some related mental-health phenotype (e.g. major depressive disorder, schizophrenia) that is available from an independent public dataset like those provided by the Psychiatric Genomics Consortium 7-9 . Using such independent external weights likely has value since it could be argued that a variant associated with a psychiatric phenotype (e.g. MDD) in one dataset is more likely to be associated with a correlated psychiatric phenotype measured by PSS or BDI in an independent dataset given existing knowledge about the shared genetic overlap among such traits 10,11 . We can construct such a SNP weight as a function of the log odds ratio of the SNP in the independent dataset. Once we determine the weight function, we then create a diagonal weight matrix W= diag(w1, w2, …, wV ), where wv, reflects the relative (normalized) weight for the v th variant in the gene. Once we construct the similarity matrices Y and X, we derive our GAMuT approach as a test of independence between the elements of these two matrices. Briefly, we center each matrix as Yc=HYH and Xc=HXH. Here, is a centering matrix with property HH=H, I is an identity matrix of dimension N, and 1N is an Nx1 vector with each element equal to 1. Using Yc and Xc , we construct our test of independence of the two matrices as Under the null hypothesis of independence of the two matrices, TGAMuT follows the same asymptotic distribution as where λ X ,i is the i th ordered eigenvalue of Xc , λ Y , j is the j th ordered eigenvalue of Yc , and ! ! z 2 ij are independent and identically-distributed χ 1 2 variables 12 . We derive P-values for our GAMuT test analytically using Davies' exact method 13 , which is a computationally efficient method to provide accurate P-values in the extreme tails of tests that follow mixtures of chi-square variables 6 . An implementation of Davies' method is available in the R library

CompQuadForm.
Adjusting for Covariates: Genetic association tests must adjust for important covariates, such as principal components of ancestry, to avoid potential confounding of results. We can control for confounders before applying GAMuT by regressing each symptom scale separately on covariates of interest and then using the residuals to form the phenotypic similarity matrix Y. Although residualizing categorical phenotypes is not standard, studies have suggested that this procedure does not affect the validity of genetic association tests in case-control studies 14,15 . As we describe in the Results section, such residualization provides an effective correction for confounders within our simulated ordinal datasets.
Simulations: We conducted simulations to verify that GAMuT properly preserves type-I error (i.e., empirical size) and to assess power of GAMuT relative to standard association tests that treat questionnaire responses as a univariate outcome variable resulting from summing the responses into a continuous score. We perform simulations based on SNPs and LD patterns located within 2 kb up-and down-stream from two genes: self-administered or self-reported, and is scored by summing the ratings given to each of the 21 items. Summing the responses yields a score ranging from 0-63, with scores higher than 28 being indicative of moderate to severe depression.
To simulate BDI data, we first generated 21 outcomes for each subject using a multivariate normal distribution with mean vector 0 and QxQ correlation matrix Σ. We calculated Σ based on observed Spearman rank correlation calculations from the GTP BDI questionnaire responses shown in Supplementary Fig. S3. The observed correlations between questions ranged from 0.22 to 0.57. Next, we generated ordinal responses from the normally distributed variables to match the ordinal responses observed in GTP data.
Frequency of scores by each of the 21 BDI questions is shown in Supplementary Fig. S4. We found the percent of GTP participants who answered 0, 1, 2, and 3 for each question. We then matched the percentages of each BDI question for each of the 21 normally distributed variables. For example, in BDI Question 1 ("Sadness"), 56% of participants answered 0 ("I do not feel sad"), 34% answered 1 ("I feel sad much of the time"), 6% answered 2 ("I feel sad all of the time"), and 4% answered 3 ("I am so sad or unhappy that I can't stand it."). To simulate ordinal responses to Question 1, the lowest 56% of the continuous outcomes were assigned a score of 0, values falling in the 57-90 percentile were assigned a score of 1, 91-96 percentiles were assigned a score of 2, and values in the 97 th percentile and above were assigned a score of 3. We set sample size N of either 1,000 or 2,500 subjects. We applied GAMuT to 10,000 null simulated datasets to estimate empirical size.

The correlation between questions q and q' is defined as
is the LxL residual correlation matrix shown in Supplementary Fig. S3. This allows the residual correlation structure among phenotypes to stay at the defined values.
Using the simulated data, we evaluated GAMuT using either projection matrices or linear kernels to model phenotypic similarity and using weighted linear kernels to model genotypic similarity (with weights based on sample MAF). We compare GAMuT to two standard approaches that use the univariate cumulative questionnaire score for inference.
First, we consider a standard linear regression model that follows the form    Supplementary Figure S5: The QQ plots of 10,000 simulated null datasets assuming a sample size of 1,000 with a confounding variable. Questionnaire responses are independent of genotypes (for SNPs in STAT3), but both responses and genotypes are associated with a continuous covariate. Left shows QQ plots without adjustment for confounding, while right shows QQ plots after adjustment for confounding by residualization.