Exemplar scoring identifies genetically separable phenotypes of lithium responsive bipolar disorder

Predicting lithium response (LiR) in bipolar disorder (BD) may inform treatment planning, but phenotypic heterogeneity complicates discovery of genomic markers. We hypothesized that patients with “exemplary phenotypes”—those whose clinical features are reliably associated with LiR and non-response (LiNR)—are more genetically separable than those with less exemplary phenotypes. Using clinical data collected from people with BD (n = 1266 across 7 centers; 34.7% responders), we computed a “clinical exemplar score,” which measures the degree to which a subject’s clinical phenotype is reliably predictive of LiR/LiNR. For patients whose genotypes were available (n = 321), we evaluated whether a subgroup of responders/non-responders with the top 25% of clinical exemplar scores (the “best clinical exemplars”) were more accurately classified based on genetic data, compared to a subgroup with the lowest 25% of clinical exemplar scores (the “poor clinical exemplars”). On average, the best clinical exemplars of LiR had a later illness onset, completely episodic clinical course, absence of rapid cycling and psychosis, and few psychiatric comorbidities. The best clinical exemplars of LiR and LiNR were genetically separable with an area under the receiver operating characteristic curve of 0.88 (IQR [0.83, 0.98]), compared to 0.66 [0.61, 0.80] (p = 0.0032) among poor clinical exemplars. Variants in the Alzheimer’s amyloid–secretase pathway, along with G-protein-coupled receptor, muscarinic acetylcholine, and histamine H1R signaling pathways were informative predictors. This study must be replicated on larger samples and extended to predict response to other mood stabilizers.


The Clinical Exemplar Score
Let (x ij , y ij ) ∈ X denote phenotypic data from subject i ∈ {1, 2, . . . , n j }, where x ij is a vector of clinical features, y ij ∈ {0, 1} denotes whether the patient is a lithium responder, and n j is the number of patients in the sample from site j ∈ {1, 2, . . . , S}. A pair (x, y) can thus be viewed as a set of coordinates on the (observable) phenotypic space X . Data are sampled from S sites, each of which can be considered to sample a subdomain of the phenotypic space X (j) ⊆ X . These site-wise subdomains are not necessarily disjoint. Indeed, if they were disjoint, the sites' data would share nothing in common. Now let M j denote a classifier learned on training data from site j. Given a new set of clinical features, x , the classifier predicts the probability that the corresponding patient is a lithium responder: that is,p j = M j (x ). We denote the accuracy score of this prediction as The representational Rényi heterogeneity measurement approach (1) consists of measuring heterogeneity on a latent or transformed space onto which observable data are mapped. To apply this in the present case, where we have defined our observable space, X , we must now devise an appropriate transformed space upon which the Rényi heterogeneity will be both meaningful and tractable. The heterogeneity deemed relevant in the present case arises in terms of differences in classification models across sites. Most starkly, we noted that the informative features for lithium response prediction varied between the best performing sites. In other words, depending on which site's data are used for training, one might learn quite different (and perhaps even contradictory) relationships between clinical features and lithium responsiveness. In the limit where data from each site encodes completely different relationships between clinical features and lithium response, then each classifier M j will behave distinctly (they will tend to disagree). In terms of numbers equivalent, we would say that in such a case there is an effective number of S distinct classifiers. Conversely, if the phenotypic domains of all sites overlap completely, then all classifiers M j will tend to make similar predictions, which would correspond to an effective number of one classifier.
Let the accuracy of classifier M j in predicting the relationship x → y be a measure of that model's informativeness at point (x, y). We can thus define T as a categorical space representing an index on "the most informative classifier. " We illustrate the mapping f : X → T in Figure 1. A probability distribution over T can be computed using a normalization of Equation 1: The quantity f j (x, y) can be taken to represent the probability that a classifier trained on data from site j is the most informative about the x → y mapping in that particular region of X . With this, we can compute the representational  (Phenotypic Space) Figure 1: Representation of the mapping from phenotypic space X onto the representation of "most informative site-level model" (T ). The transformation function is the normalized accuracy score for a classification model trained on each site's data individually (Equation 2).
Rényi heterogeneity at (x, y) as follows: (3) If the models M j=1,2,...,S differ only in their training data (i.e. they have the same architecture, optimization routine, and hyperparameters) then the units of Equation 3 are "the effective number of informative sites." Recall that we defined a "clinical exemplar" as a subject whose phenotype (x, y) is reliably predicted accurately across all sites. In other words, regardless of the differences between sites' data, all sites would agree in their predictions of the exemplars' phenotypes. More formally, clinical exemplars must have high values of Π q (x, y) (all sites are similarly informative). However, to identify more specifically the exemplars of lithium response and non-response, we cannot solely rely on Π q (x, y), since that value may be high, despite sites' prediction accuracies being low.
Let t * = max jfj (x, y) denote the maximal accuracy score obtained in classification at (x, y). We take this value to represent the degree to which a subject with that phenotype can be clearly associated with one class or another. An interesting case occurs where both t * and Π q (x, y) are high, suggesting the point (x, y) is an exemplar of the regions of X that are reliably well classified across sites. Conversely, if t * ≈ 0.5 and Π q (x, y) is high, then that point is exemplary of a region of X of which all sites are uncertain. When t * is low and Π q (x, y) is high, then (x, y) is exemplary of a region of X that reliably misleads all sites' classifiers.
In the present study, we are concerned with identifying only those subjects with high values of both t * and Π q (x, y), since they exemplify the most canonical "phenotypes" of lithium response and non-response, respectively. We accomplish this by combining t * and Π q (x, y) into a single index we call the exemplar score. The exemplar score at coordinate (x, y) of the phenotypic space is defined as whereΠ q (x, y) is a standardization of the Rényi heterogeneity to the [0,1] interval (the same scale as t * ): In the present study, we define the "best exemplars" as subjects whose exemplar scores (within their lithium response classes) were in the top 25%. Poor exemplars were those subjects whose phenotypes were in the lower quartile of exemplar scores within their response classes.

The Predict Every Subject Out (PESO) Protocol
The predict every subject out protocol (PESO; Figure 2) is a method by which we can compute exemplar scores for each subject in the dataset while (A) ensuring that subject is not included in the training data and (B) having each model train on only that site's data. All classifiers in our data were random forests, (RFC) (2) under the same specifications as in Nunes et al. (3) (100 estimators; SciKit Learn implementation; (4)). Similar to that study, missing data were marginalized by sampling from uninformative priors on respective variables' domains (3).
For each site in the clinical predictors dataset, the PESO analysis protocol begins with a Leave-One-Out crossvalidation run to obtain out-of-sample predictions for each of that site's constituent subjects. We then train an RFC on that site's data and predict lithium response in all other sites' subjects. Each subject is thus mapped onto our categorical space T , upon which we can measure their exemplar scores.

Gene Set Analysis
At each fold of cross-validation, the logistic regression coefficients were saved. The SNPs whose logistic regression coefficients were of the same sign (i.e. positive or negative, such that we focus only on SNPs with consistent associations) across all folds were ranked in terms of their absolute median coefficient values and linked to gene identifiers using the NCBI gene database. Each gene was assigned the maximal absolute value of the logistic regression coefficients for all SNPs tagged by that gene; the remainder (duplicates) were deleted, such that each included gene had only one numerical value associated with it. We then applied the statistical enrichment test in the PANTHER classification system v. 14.1 Validation Subject Training Data (Site j) For k ∈ {1, 2, . . . , K} Iterations of Uninformative Imputation Uninformative Imputation and SMOTE-Tomek Rebalancing  (5). We repeated the statistical enrichment test for the following annotation sets: PANTHER pathways, GO molecular function (complete), GO biological processes (complete), GO cellular components (complete). To further evaluate the degree to which the enrichment analyses speak specifically to findings among the best exemplars, we repeated the same procedures outlined here using the logistic regression coefficients for the poor exemplars.

Summary of Genomic Preprocessing Methods
Our raw dataset consisted of the genotypes resulting from the preprocessing and imputation steps taken by Hou et al. (6). We summarize their quality control and imputation steps here. However, note that the sample used for the present study includes only SNPs that were directly genotyped across all sites. Our subject sample is restricted only to those from the Dalhousie University sample of ConLiGen, since these were the only such subjects for whom clinical variables were also available. Hou et al. (6) provided the following quality control parameters for retaining SNPs and subjects. Subject-wise SNP-missingness rate less than 0.03. The autosomal heterozygosity rate was within a mean of +/-3 standard deviations. Minor allele frequency must have been greater than 0.01. Missingness (SNP-wise) must have been less than a rate of 0.05. The SNP Hardy-Weinberg equilibrium p-values were greater than 10 −4 in all samples. Hou et al. (6) detected no discrepancies between reported and genotypic sex.

Clinical Data Cohort Descriptions
Descriptions of the dataset of clinical variables is provided in Table 1.

Evaluation of Population Structure
To evaluate for the presence of population stratification in our genomic sample, we plot the first several principal components of the subjects' genotypes in Figure 3. For comparison, Figure 4 demonstrates the first several principal components from 14 sites of the full Consortium on Lithium Genetics (ConLiGen) genomic sample.

Supplementary Figure 4:
Principal components analysis of the genomic dataset from the Consortium on Lithium Genetics sample (6). Figure 5 plots the site-level models' accuracy distributions. This plot shows that the site-level accuracies were highly variable in shape and modality. This provides further confirmation that classification behaviour between site-level models was heterogeneous. Several of the distributions shown in Figure 5 can be easily appreciated as corresponding to the Brier scores reported in evaluation of model calibraion by Nunes et al. (3). For example, the Brier score for the Maritimes clinical dataset was 0.15 (95% CI 0.13-0.16), whereas for the Poznan site it was 0.24 (0.23-0.24) in the original study. Figure 5 indeed shows that the probabilistic predictions made by the Maritimes site are more widely distributed than those of Poznan, as one would expect with a better calibrated model. One can also appreciate the limitations inherent to IGSLi's inclusion of only lithium responders (which in Nunes et al. (3) prevented reporting of a site-level analysis for this sample). That is, since IGSLi includes only lithium responders, it achieves perfect accuracy only for the lithium responders, with completely erroneous responses for the non-responders.

Demographic Comparisons of Best and Poor Exemplars among Genotyped Subjects
Clinical demographic comparisons between the best exemplars, poor exemplars, and the aggregated sample of genotyped patients is presented in Table 2, with stratification by lithium response. The results of gene enrichment analysis are presented in Table 4, with specific genes enriched in the best exemplar group (related to glutamate receptors and signalling processes) shown in Tables 5 and 6.

Supplementary Table 2:
Demographic comparisons for subjects whose genomic data (from the Consortium for Lithium Genetics; ConLiGen) overlapped with our clinical dataset. Comparisons were done in between lithium responders (LR(+)) and non-responders (LR(-)) for the total group ("ALL), the best exemplars ("Best; exemplar score ≥ 75th percentile), and the poorest exemplars ("Poor; exemplar score

Results of Genomic Classification of Lithium Response
Supplementary Table 3: Results of classifying lithium response based on the genomic data of all subjects (ALL; n=321), the poor exemplars (<25th percentile of exemplar score; n=81), and the best exemplars (>75th percentile of exemplar score; n=79). Each panel shows the results for a different classification performance metric. Classification was done using logistic regression with an L2 penalty (regularization weight set to C=1 a priori) with stratification done over each value of the resolution parameter q=1 and q=2. Abbreviations: area under the receiver operating characteristic curve (AUC), Cohen's kappa (Kappa), Matthews correlation coefficient (MCC), positive predictive value (PPV), negative predictive value (NPV). Results are presented as means and 95% confidence intervals.

Preliminaries
Out-of-sample model criticism requires splitting a dataset into training and testing partitions. This is often done repeatedly using cross-validation. Performance estimates will have a higher variance when computed based on smaller test sets. This can easily be shown in closed form as follows. Let N T be the size of the test set, and N C the number of examples correctly classified. The probability distribution over N C is binomial with parameters N T and 0 < θ < 1, where θ is the underlying accuracy of the model. Since the conjugate prior for a binomial likelihood is Beta(θ|α, β) with hyperparameters (pseudo-counts) α > 0 and β > 0, then the posterior over θ is Beta(θ|α + N C , β + N T − N C ). The posterior variance is Under a uniform prior, Beta(θ|α = 1, β = 1), the maximum likelihood estimate (MLE) of accuracy for a given test set isθ = N C /N T , and the posterior variance can be rewritten as It is trivial to show that Equation 7 is a strictly non-increasing function with respect to N T , and that its limit in large N T is zero. It is therefore clear that with small N T , there is a greater probability of obtaining extreme accuracies (both high and low, as suggested and shown by (9)(10)(11)). However, given publication bias, one would consequently expect to see the phenomenon highlighted by Schnack & Kahn (12), whereby larger test set sizes are negatively associated with classification performance. This can be appreciated by visualizing the inverse survival function of the upper tail of Beta(θ|α + N C , β + N T − N C ) under a uniform prior and an MLE ofθ o = 0.5 = N C /N T , allowing us to substitute (α + N C ) = (β + N T − N C ) = 1 + N T /2. The inverse survival function of this beta distribution, denoted ψ o (p, N T ), is the value of θ such that Prob(X > θ) = p for 0 < X < 1: where I −1 (1,−p) is the inverse of the regularized incomplete beta function. Figure 6 illustrates the effect of test set sample size on top 100p th percentile of accuracy achieved by a "null" or "trivial" classifier whose true underlying accuracy iŝ θ o = 0.5.
Letθ(N T ) be the expected accuracy of a classifier applied to some data, evaluated out-of-sample with test-set size N T . Our classifier is better than the null if Prob(X >θ(N T )) is small, or equivalently ifθ(N T ) > ψ o (p, N T ) consistently with respect to N T for some small value of p. For example, ifθ(N T = 10) > ψ o (p = 0.05, N T = 10), then our classifier performs better than 95% of expected null classifiers at a test-set size of 10. Similarly, ifθ(N T = 10) > ψ o (p = 0.01, N T = 10), then our classifier has passed an even more stringent test, with better performance than 99% of expected null classifiers.

Sensitivity to Test-Set Size
We repeated our genomic classification experiment for the aggregate sample (denoted "ALL"), as well as the best and poor clinical exemplar strata (those individuals with the top and bottom 25% of exemplar scores, respectively). However, rather than using 10-fold stratified cross-validation, as in the main text, we conducted 100 randomized train-test splits at each of the following test set proportions: p test ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. For example, at p test = 0.1, we hold out 10% of observations as a test set within the shuffle-split regime.
The classification performance attained at each of the 100 train-test splits conducted at each test-set size N T (within each of the ALL, best clinical exemplars, and poor clinical exemplars) can be compared to the corresponding value of ψ o (p, N T ). Figure 7a plots these comparisons across test-set sizes and strata for p ∈ {0.05, 0.025}. Figure 7b plots the proportion of shuffle-split runs for which classification performance exceeds ψ o (p = 0.05, N T ) at each test-set size (expressed as a proportion). One can appreciate that the mean classification accuracy for the poor exemplar stratum never exceeded ψ o (p = 0.05, N T ). Genomic classification accuracy within the best clinical exemplars was consistently better than that expected from a null model, with the mean classification accuracy exceeding ψ o (p = 0.05, N T ) at all test-set sizes. Mean classification accuracy within the "ALL" stratum also exceeded ψ o (p = 0.05, N T ) at test-set proportions p test > 0.1. Figure 8 shows the effects of increasing test-set size on classification performance statistics. The most important finding is that classification performance within the best clinical exemplar stratum was superior to classification within either the whole genomic sample or the poor clinical exemplars. This verifies the central claim of our paper. The area under the receiver operating characteristic curve (AUC) remains on the order of ≥ 0.8 until the test set is increased to ≥ 50% of the total sample size of the best clinical exemplar stratum. Results of the pathway analysis are shown in Table 4. Genes that were enriched among the glutamatergic synapse cellular component are shown in Table 5. Genes that were enriched among the glutamatergic signalling biological process are shown in Table 6.

Further Rationale for SNP Set Used in Classification Analyses
Filtering-based feature selection approaches in our present study would be (A) too computationally expensive across these millions of variants and (B) require much larger sample sizes since they must be repeated within each training partition. We also had no dominant a priori biological rationale for limiting the data to a restricted subset, since, as our results later confirmed, these biological systems may differ between exemplar strata. Ultimately, we chose the set of completely genotyped SNPs that overlapped across ConLiGen sites in order to facilitate the potential conceptual generalizability of our pathway analysis results, in particular. That is, since the pathways detected were based on variants that are broadly genotyped, these results could potentially be extended to other ConLiGen sites, should the corresponding clinical variables become available.