Prediction of lithium response using genomic data

Predicting lithium response prior to treatment could both expedite therapy and avoid exposure to side effects. Since lithium responsiveness may be heritable, its predictability based on genomic data is of interest. We thus evaluate the degree to which lithium response can be predicted with a machine learning (ML) approach using genomic data. Using the largest existing genomic dataset in the lithium response literature (n = 2210 across 14 international sites; 29% responders), we evaluated the degree to which lithium response could be predicted based on 47,465 genotyped single nucleotide polymorphisms using a supervised ML approach. Under appropriate cross-validation procedures, lithium response could be predicted to above-chance levels in two constituent sites (Halifax, Cohen’s kappa 0.15, 95% confidence interval, CI [0.07, 0.24]; and Würzburg, kappa 0.2 [0.1, 0.3]). Variants with shared importance in these models showed over-representation of postsynaptic membrane related genes. Lithium response was not predictable in the pooled dataset (kappa 0.02 [− 0.01, 0.04]), although non-trivial performance was achieved within a restricted dataset including only those patients followed prospectively (kappa 0.09 [0.04, 0.14]). Genomic classification of lithium response remains a promising but difficult task. Classification performance could potentially be improved by further harmonization of data collection procedures.


CONTENTS Additional Details on Data Collection 2 Simulated p-Values of Null Classifiers 3
Additional analysis results 5 Aggregate and Site level Analysis 5 Table S1  5 Leave One Site Out Analysis 6 Table S2  6 Predict One Site Out Analysis 7 Table S3  7 Analysis under 100-Fold Shuffle-Split Partitioning 8 Gene Set Analyses in the Halifax and Würzburg Samples 11 Table S4  11   Table S5  14   Table S6  16   Table S7  20   Table S8  27   Table S9  28 Classification Analysis with Progressively Narrowed Lithium Response Definitions 33

Simulated p-Values of Null Classifiers
We computed p-values for the classification performance measures by simulating the performance of a null classifier. Here, we describe this procedure formally, although the procedure is implemented in Python 3.5+ code found at ( https://github.com/abrahamnunes/criticism ).
We are given a dataset with -dimensional feature vectors and binary class labels for subjects indexed by . During model criticism, the dataset is split into disjoint training and validation partitions where the subscripts T and V denote the training and validation sets, respectively, and the cross-validation folds are indexed by . Importantly, the cross-validation partitions are done in a stratified fashion, such that the probability of (in our study this would be the probability of lithium response) is approximately equal in both the training and validation sets.
Under the use of stratified cross-validation, we can compute the baseline probability of the positive class as where is an indicator function that evaluates to 1 if its argument is true. Now let represent the classifier's sensitivity, and its specificity. Given only and the number of subjects , we can specify prior distributions on and of a "null classifier" as follows: We note that a classifier with sensitivity and specificity sampled from these distributions is trivial if we recall that the parameters of a beta distribution are pseudocounts. Since the first argument in each of the Beta(., .) functions is the pseudocount for the positive class, we can see that, given a sample of size , the expected number of times that a null classifier will identify a subject as being a lithium responder when he or she truly is one will be , and the expected number of times that the null classifier will identify a lithium responder as a non-responder is . We can interpret the parameters of the prior on in the same way, except reversed in order.
Given and , we can compute the null classifier's expected confusion matrix for the sample of size . If denotes the true label and denotes the predicted label, we have the following confusion matrix.
/ which corresponds to and from which we can compute the Cohen's kappa metric as follows: • • Let the null classification performance statistic be denoted as , and let denote the performance statistic computed for the actual model predictions and (where the latter represent the predicted probabilities of the positive class by the trained classifier ). The simulated probability that the null classifier obtains performance greater than under the given performance metric is . Given the simulations of the null classifier's performance, we can approximate this tail probability as where is the 'th sample of the null classifier's performance under a binary response set of size .
Default parameters were used in all analysis that involved using the criticism package.

Aggregate and Site level Analysis
Table S1 Table S1: The performance of the xgboost (XGB) classifier in the aggregate and site-level analyses. Table  columns represent different classification statistics and values represent the mean of each statistic over five folds along with an empirical 95% confidence interval. Asterisks signify that the given metric was found to have a p-value less than 0.01 in comparison to the simulated null classifier. Abbreviations : all sites (ALL; i.e. aggregate analysis), area under the receiver operating characteristic curve (AUC), positive predictive value (PPV), negative predictive value (NPV), F-1 score (F1). Centre AUC Accuracy F1 NPV PPV Sensitivity Specificity Kappa  a -Kappa value was found to have a p-value less than 0.01 in comparison to the null classifier. Table S2: The results of the XGBoost classifier in the leave one site out analyses. Table columns represent different classification statistics and values represent the mean of each statistic over five folds along with an empirical 95% confidence interval. The center column shows the center that was left out for each row. Asterisks signify that the given metric was found to have a p-value less than 0.01 in comparison to the simulated null classifier. Abbreviations : area under the receiver operating characteristic curve (AUC), positive predictive value (PPV), negative predictive value (NPV), F-1 score (F1).            The methods of this experiment are reported in the main body of the text. Results of the aggregate and site-level classification analyses across the four Alda score thresholding conditions are reported in Table S9. Table S10 33 Table S10: Results for the aggregate and site level classification analyses using four different Alda score thresholds to separate responders and non-responders. Columns represent the different thresholds and values represent the mean value of the given statistic across five folds with a 95% empirical confidence interval. Asterisks in the rows that represent the Kappa statistic signify a p-value less than 0.01 in comparison to the simulated null classifier.

Classification Analysis Stratified by Follow-Up Methodology
The methodology for this analysis is described in the main text. Briefly, we split the dataset into two subsets: (A) one including only those sites who followed subjects prospectively, and (B) one including sites that did not follow subjects prospectively. The aggregate, leave-one-site-out, and predict-one-site-out analyses were repeated within each of these conditions, respectively. Our primary outcome metric in this analysis was Cohen's kappa, whose mean and 95% confidence intervals across folds were compared in the aggregate analysis analysis-wise between models trained on the prospective and retrospective datasets. Since the leave-one-site-out analyses do not have direct comparators across the prospective and retrospective datasets, we estimated the "statistical significance" of Cohen's kappa results using the p-value simulation procedure described below.
Classification Results Table S11 41 Table S11: This table displays      Principal Component Analysis Figure S1 shows an ensemble of plots detailing the population structure along with the distributions of responders and non-responders in the training and validation sets in each fold for the aggregate analysis. The y-axis in each subplot represents the first principal component (PC) with each column representing PCs two, three and four respectively. Each plot in the first row has points colored by data collection site where circular points represent non-responders and crosses represent responders. In this first row we see a clear representation of the population structure that exists between sites. In each of the next five rows of plots each point is again shaped according to lithium responsiveness, but is colored according to whether it was in the training or validation set for the given fold. If population structure interferes with assessment of performance in the aggregate analysis, then one should appreciate a systematic relationship between all three of (A) population structure, (B) lithium response, and (C) inclusion in all validation sets.
From Figure S1 we see that, despite the existence of population structure and variance in class proportions between sites, there is an effectively random distribution of training and validation examples across folds, which does not allow a classifier to consistently exploit population structure during cross validation. Figure S3: This figure shows the first principal component (PC) of the genotype matrix plotted against the 2nd, 3rd and 4th PCs with various coloring schemes. In each subplot the shape of the points represents whether each sample is a lithium responder (cross) or non-responder (circle). In the first row, samples are colored according to which site they originated from. In each of the next five rows the color represents