Antidepressant drug-specific prediction of depression treatment outcomes from genetic and clinical variables

Individuals with depression differ substantially in their response to treatment with antidepressants. Specific predictors explain only a small proportion of these differences. To meaningfully predict who will respond to which antidepressant, it may be necessary to combine multiple biomarkers and clinical variables. Using statistical learning on common genetic variants and clinical information in a training sample of 280 individuals randomly allocated to 12-week treatment with antidepressants escitalopram or nortriptyline, we derived models to predict remission with each antidepressant drug. We tested the reproducibility of each prediction in a validation set of 150 participants not used in model derivation. An elastic net logistic model based on eleven genetic and six clinical variables predicted remission with escitalopram in the validation dataset with area under the curve 0.77 (95%CI; 0.66-0.88; p = 0.004), explaining approximately 30% of variance in who achieves remission. A model derived from 20 genetic variables predicted remission with nortriptyline in the validation dataset with an area under the curve 0.77 (95%CI; 0.65-0.90; p < 0.001), explaining approximately 36% of variance in who achieves remission. The predictive models were antidepressant drug-specific. Validated drug-specific predictions suggest that a relatively small number of genetic and clinical variables can help select treatment between escitalopram and nortriptyline.


Quality control and population structure
Quality control procedures were applied in PLINK (1), initially at the level of marker and then at the level of individual. Markers were retained if they had a minor allele frequency (MAF) of 0.01 or more as effects of rare markers would be uninterpretable with the present sample size. Markers were filtered for genotyping completeness of 99% so that all analyses were performed in a comparable set of individuals. Hardy-Weinberg Equilibrium (HWE) was not used as a filter as departures from HWE are expected in a case-only sample.
At the individual level, genotypes were first tested for sex mismatch with phenotypic data. Ambiguous genotypic sex and outliers on autosomal heterozygozity were investigated for exclusion as these may indicate sample contamination. Related individuals were ascertained through estimation of identity by descent (IBD) applied in PLINK to an LD-pruned dataset and one of each pair of first-or second-degree relatives (the one with less complete data) was excluded. Finally, genotyping completeness was assessed for each individual. IMPUTE v2 program (2) was used to impute SNPs missing data to the 1000genomes (build 37). Given the minimal percentage of missingness, any missing value was completed following a best guess approach. Quality-control measures ensured only the most accurately imputed SNPs were used (info score filter of 0.1 and a genotype probability threshold of 0). We specified a Major Allele Frequency (MAF) of 0.005. Variants showing a linkage disequilibrium (LD) over 0.8 were excluded from analysis. A total of 524871 common genetic variants were analysed.
A genomic control lambda value was computed to assess false positive evidence of association due to genetic markers differing in genotype frequencies between subpopulations of remitters and non-remitters. As the inflation factor lambda was 0.9794552 (less than 1) no adjustment was necessary (3). Estimation was done using the GenABEL R package (4).

Methods
The aim of the analysis was to assess how demographic, clinical and genetic baseline information in combination predicted whether individuals achieved remission in the whole sample of patients treated either with escitalopram or nortriptyline. Same variables and steps of analysis used within every drugspecific group (and fully detailed in the methods section of the manuscript) were applied to the whole data set. To provide a completely independent test of each prediction model, we randomly split the data into mutually exclusive training dataset (65% of participants) and replication dataset (the remaining 35%).
Sample sizes for training and test data sets were 280 and 150 respectively. The parameters for every model were estimated in the training dataset following a standard 5-fold cross validation approach. The predictive ability for the resulting model was then tested in the independent replication dataset, which was not used in any way in the model derivation.

Results
In the training dataset of participants treated with either escitalopram or nortriptyline, 12 variables were selected for the prediction of remission status. The selected predictors included the baseline total scores for HRSD, the symptom dimension of observed mood, loss of appetite and 9 genetic variants (Table S3,   Table S4). The elastic net logistic regression model built from these 12 variables predicted remission in the replication dataset with an AUC of 0.69 (95%CI 0.61-0.76) and p value 0.017, sensitivity 0.68, specificity 0.69 and a pseudo R2 0.17 (Table S4).

Discussion
The collagen gene COL25A1 has implications in Alzheimer's disease (1) and has been associated with comorbid Antisocial Personality Disorder and Substance Dependence (2). c-Maf cooperates with Sox9 to activate the type II collagen gene (3). TBC1D8 gene has been shown to be predictive of risk of postpartum depression (4), as well as associated with osteoporosis (5) and identified as predictor of pancreatic cancer (6). The ITGB2 immunomodulatory gene and its protein CD18 was demonstrated in pruning neuronal synapses during brain development, with knockout mice for ITGB2 displaying deficits in synaptic connectivity along with several behavioural impairments (7)(8)(9)(10)(11). ITGB2 and its protein CD18 has been also associated with several conditions: papillary thyroid cancer (12), inflammatory mechanisms (13), vasculitis (14), Hirschsprung's disease (15), repair of the infarcted myocardium (16), obesity (17) and alcohol response (18).