Introduction

Current diagnostic approaches in psychiatry are based almost exclusively on phenomenological observations rather than biological verification. Unlike other medical fields, in psychiatry objective biomarker measures are rarely used to support clinical decision-making. Growing evidence indicates that the existing diagnostic formulations show poor correspondence to emerging biomarker-based constructs1,2. The absence of biologically-informed disease entities—e.g. for psychotic disorders—or actionable biomarkers within the global disease constructs hampers progress in understanding disease mechanisms and treatment development.

Recent strategies using underlying neurobiology have challenged diagnostic boundaries of psychoses. Instead of searching for biomarkers for ‘schizophrenia’ or ‘bipolar disorder’, data-driven approaches using broad biomarker panels have been applied to re-categorize psychosis cases into subgroups with more cohesive neurobiological profiles3,4. Using this approach, the Bipolar-Schizophrenia Network for Intermediate Phenotypes (B-SNIP) recently developed biomarker-based psychosis constructs—Biotypes—that capture biologically-distinctive groups of psychosis cases3. Based on a multistep multivariate analyses using cognition, EEG, and oculomotor measures, three distinctive Biotypes emerged: Biotype1 (B1), characterized by poor cognitive and low sensorimotor function; Biotype2 (B2), with moderately impaired cognition and exaggerated sensorimotor reactivity; and Biotype3 (B3), with near normal cognitive and sensorimotor functions3. Notably, the conventional diagnoses [schizophrenia (SZ), schizoaffective disorder (SAD), psychotic bipolar I disorder (BD)] mapped poorly onto the Biotypes, with all three diagnoses distributed across all Biotypes. Analysis of whole brain voxel-wise morphometry—an “external validator” not used in Biotype development—demonstrated a step-wise pattern of gray matter density (GMD) reductions across the Biotypes: in B1, extensive and diffusely distributed GMD loss, with the largest effects in frontal, anterior/middle cingulate, and temporal regions; in B2, intermediate in magnitude and more localized reductions, with the largest effects in insula and fronto-temporal regions; and in B3, modest GMD reductions primarily localized to anterior limbic regions5. In the same sample segregated by conventional diagnoses, we observed highly similar (and diffusely distributed) GMD reductions in SZ and SAD, and modest, primarily frontal reductions in BD. Biotypes showed better between-group discrimination based on GMD features and were a stronger predictor of GMD alterations than the diagnoses. Additionally, distinct patterns of resting state connectivity6 emerged across the Biotypes.

There has been a growing interest in using machine learning approaches to disentangle the heterogeneity of mental illness. A critical question is whether there are biological measures—especially those commonly used in clinical practice, such as structural MRI—that can accurately predict different psychosis groups according to conventional diagnoses or, of importance here, to a Biotypes classification scheme. Prior research has primarily focused on using structural MRI measures (e.g., GMD/volume, cortical thickness) to train classification algorithms in order to discriminate psychosis groups (mainly SZ and BD) from healthy controls (CON). Previous reports have demonstrated that gray matter-focused features can discriminate SZ vs. CON with high levels of accuracy (0.66–0.89)7,8,9,10,11,12, including in multi-site datasets13,14 (for related findings from “fusion” approaches, see Refs.15,16,17,18,19). Studies in BD have found somewhat lower classification accuracies for BD vs. CON (0.61–0.78)20,21,22 relative to those for SZ. Few studies have attempted to discriminate SZ vs. BD in a single analysis, and yielded modest classification accuracies (e.g., 0.66 in Ref.23). The limited ability to differentiate between SZ and BD is likely due to the significant biological heterogeneity of the disorders, as previously demonstrated in our Refs.3,5 and others’ work24,25,26,27,28,29,30. Recently, Mothi et al.31 investigated the utility of unsupervised machine learning for delineating psychosis subgroups in the B-SNIP sample. Integrating symptom-based ratings and biomarker data, they identified three distinct subgroups (called “G1, G2 and G3”, different from the Biotypes). Subsequent analysis of external validators showed that the subgroups differed significantly in cortical thickness, oculomotor and general and social functioning measures, with G1 showing the greatest, and G3, the least impairments. Similar to the B-SNIP Biotypes, the psychosis subgroups identified in this study showed only limited correspondence with conventional diagnoses31.

Building on our prior work demonstrating distinct VBM-based GMD alterations among the B-SNIP Biotypes3,5, we used a supervised machine learning classification approach to determine whether GMD characteristics can reliably discriminate between psychosis groups—categorized either according to Biotype or conventional diagnosis—and healthy individuals. We hypothesized that a GMD-based classifier would show more specificity for biologically-defined Biotypes relative to symptom-based diagnoses. In addition, we explored whether GMD-based classifier output for Biotypes is associated with clinical and other biomarker measures. This exploratory analysis aimed to examine if the GMD-based classifier captures dimensions of psychosis that fall along a continuum that is present in both psychosis cases and CON (i.e., independent of group membership).

Methods

Study sample

Voxel-wise GMD metrics from the Voxel-Based Morphometry pipeline32,33 were extracted in 808 subjects [557 psychosis, 251 CON] initially categorized according to Biotype, and then, by DSM diagnoses (for demographic and clinical data, see Table 1 and Supplemental Table S1).

Table 1 Socio-demographic and clinical characteristics of the study sample by Biotype and Conventional Diagnosis.

The B-SNIP study’s logistics and overall sample characteristics are described elsewhere34. Psychosis subjects were stable, medicated outpatients. CON subjects had no personal history of psychotic or recurrent mood disorders and no family history of schizophrenia/bipolar spectrum disorders in first- or second-degree relatives. Psychiatric diagnoses (and absence thereof in CON) were established via formal diagnostic consensus conferences including a review of the Structured Clinical Interview for DSM-IV-TR Diagnosis (SCID-I/P)35 and all available clinical information. The study was approved by Institutional Reviews Boards at all B-SNIP data collection sites: (1) University of Texas Southwestern Medical Center, (2) Olin Neuropsychiatry Research Institute, Hartford Hospital, Yale School of Medicine, (3) Maryland Psychiatric Research Center, University of Maryland School of Medicine, (4) University of Illinois at Chicago, (5) Wayne State University, School of Medicine, (6) Harvard University Medical School. All subjects provided written informed consent after the study procedures had been fully explained, and all study procedures were performed in accordance with relevant guidelines and regulations.

Gray matter density parameters extraction for machine learning analyses

T1-weighted structural images were acquired on 3 T MRI scanners at 5 B-SNIP sites. Magnetization Prepared Rapid Gradient Echo (MPRAGE) or Inversion Recovery-Prepared Spoiled Gradient-Echo (IR-SPGR) sequences, as appropriate for the scanner brands, were used; image parameters were consistent with the Alzheimer's Disease Neuroimaging Initiative (ADNI1) protocol (http://adni.loni.usc.edu/methods/documents/mri-protocols/). Images were preprocessed and analyzed using the optimized Voxel-Based Morphometry32 toolbox (VBM8) for Statistical Parametric Mapping (SPM8) (http://www.fil.ion.ucl.ac.uk/spm/software/spm8). The analysis pipeline incorporated the Diffeomorphic Anatomical Registration Through Exponentiated Lie Algebra (DARTEL), a high-dimensional nonlinear inter-subject registration tool33,36. The MRI parameters, quality control procedures, and pipeline are detailed in Supplemental Methods.

GMDs were extracted from the segmented and modulated gray matter images (smoothed at 8 mm FWHM) within a gray matter mask. We adopted similar procedures to prior work (e.g. Ref.37) to restrict the classification analyses to grey matter voxels. The mask included voxels that met two criteria: values > 0.40 and < 0.60 in the grey and white matter tissue probability maps, respectively. These thresholds were selected based on visual inspection by the first author to minimize including an excessive amount of white matter voxels in the mask and to minimize partial volume effects. Although visual inspection is a common and generally desirable step in imaging analysis as it allows to “screen-out” gross artifacts and other image irregularities, it can be limited because the decision processes can be difficult to replicate. We have included the specific mask in the Open Science Framework repository for this project to support reproducibility of the analyses reported here (https://osf.io/9ra6j/?view_only=02b0bd7639c64bddbb6ecc4903c1e5d7). No further feature reduction or selection steps were included in the analysis pipeline. Thus, all 371,243 features were used in the machine learning analyses.

Clinical and biomarker measures associated with machine learning classifier output

We further explored associations between GMD-based classifier performance (specifically, the B1 classification model, the only model that demonstrated ‘specificity’, see “Model training and testing” and “Gray matter density-based classifier performance across the biotypes”) and several clinical and biomarker measures. The variables of interests for these analyses were chosen based on two rationales. First, these data were available on both psychosis and CON subjects which was necessary for testing “brain-behavior” associations dimensionally (i.e. across both psychosis and CON individuals regardless of their group membership). Second, the selected clinical and biomarker variables were not used in the original Biotype development3. This allowed us to avoid potential “circularity” in the association analyses. In total, six clinical and biomarker measures that satisfied both rationales were selected for the association analyses: an estimate of premorbid general intellectual ability [the Wide Range Achievement Test-4, Word Reading subtest (WRAT-4)], oculomotor function [the Smooth Pursuit Eye Movement (SPEM) task], EEG [intrinsic EEG activity (IEA) derived from inter-stimulus intervals during an auditory paired stimuli task38], and general and social functioning [DSM-IV Axis V: Global Assessment of Functioning Scale (GAF), Birchwood Social Functioning Scale (SFS)34]. The SPEM and IEA data were reduced via a principal component analysis (PCA), yielding two PCA components for SPEM and one PCA component for IEA. Proportions of subjects with missing data across the 6 variables of interest, comparable across the Biotype and CON groups, are shown in Supplemental Table S2.

Machine learning analyses

Model training and testing

Machine learning analyses examined if patterns of GMDs can reliably classify psychosis cases organized by Biotype or diagnosis. We used a repeated train/test split approach with 1000 iterations. For each iteration, a randomly selected a subset of the data was used to train the classification model, and the held-out data was used to test the performance of the classification model. All classification models were based on L2-normed logistic regression models (penalty = 1) using the liblinear package39 implemented in the Princeton MVPA toolbox (https://github.com/princetonuniversity/princeton-mvpa-toolbox).

The following methods apply to each iteration of the repeated train/test split approach. A random sample of 88 cases from each Biotype (B1, B2, B3) and from CON was used for model training. This training set size was selected to ensure a minimum of 50 test cases in each group. The DSM categorization was not considered for case selection for the Biotype groups. The remaining cases were used as the test set (B1, n = 62; B2, n = 97; B3, n = 134; CON, n = 163). Three binary classification models were trained that discriminated one of the three Biotype groups vs. CON. The same n = 88 CON training sample was used for the three classification models in each iteration. The trained models were then applied to (i.e., tested on) every case in the held-out test groups. Classifier accuracy was computed using a balanced accuracy metric (i.e., unweighted average of each groups classification accuracy, or the average of the sensitivity and specific of the classifier) given the unequal number of cases in number of cases between two classes in the test data. A model’s classification accuracy was determined to be significant if the 99.17% confidence interval (CI) for overall classifier accuracy (i.e., aggregate classification accuracy across both groups in the model) across the 1,000 repeated train/test iterations did not encompass the nominal chance value of 50%. We used a 99.17% CI as a conservative approach to control for multiple comparisons (three) in the overall classification accuracies within each approach, i.e., Biotype or diagnosis.

Importantly, a model for each Biotype (e.g., B1) was also tested on the other two Biotype groups (e.g., B2 and B3), allowing assessment of the ‘specificity’ of each model. The idea behind this approach is that if a given model (e.g., B1) classifies the designated Biotype group above nominal chance but fails to do so for the other two Biotypes, then the model is likely identifying GMD features that are specific to a particular Biotype group. If, however, a model classifies, or labels, the other Biotype groups at rates exceeded nominal chance, then the model is likely identifying non-specific GMD features indicative of psychosis as a whole.

The above methods were also employed to examine classification of the three diagnostic groups [SZ, SAD, BD]. Case selection for each iteration was not stratified with respect to the Biotype membership. The training set size was identical to that described above, and the remaining cases [SZ, n = 154; SAD, n = 50; BD, n = 89; CON, n = 163] were held out from model training to allow a test of classifier accuracy. Each diagnosis model (e.g., SZ vs. CON) was applied to all cases in the test set for the other diagnostic groups (e.g., SAD and BD) to test model specificity.

The classification accuracy data for all models were analyzed in R40, and the raincloud figures used to visualize the data were created using ggplot241. The feature importance weights for each voxel in each model were used to create feature importance maps following a similar procedure as the classification accuracy measures. The procedures specific to this are described in the Supplemental Methods, and a brief description of these data are provided in “Feature weights for the classification models” below.

Associations between brain structure-based classifiers and other biomarker and clinical measures

We carried out exploratory analyses examining the relationship between GMD-based classifier output and additional biomarker and clinical measures (described in “Clinical and biomarker measures associated with machine learning classifier output”). These exploratory analyses were restricted to classifier output from B1 classification model because it was the only model that demonstrated ‘specificity’ (see “Model training and testing” and “Gray matter density-based classifier performance across the biotypes”). We conducted a series of regression analyses that predicted clinical and biomarker measures given classifier output (i.e., prediction probability) from the B1 model. This analysis was conducted on the entire test sample from each iteration of repeated train/test splits of the data. Using the cases across all four groups allowed us to test for an interaction between GMD-classifier evidence and group membership, and the clinical and biomarker measures. The rationale for this analysis is rooted in the idea that the GMD-based classifier model may provide a sensitive measure that can reliably predict an individual’s clinical and/or neurobiological profile regardless of group membership. Finding a group-invariant relationship would suggest that this GMD-based classifier approach might capture important biomarker and clinical characteristics that span a “disease/CON” dimension, and that are not characteristic of only a single psychosis group.

Each multiple regression model included the biomarker or clinical measure as the outcome variable and a total of 7 predictor variables: classifier output from the B1 model, three dummy-coded group variables (one for each Biotype group; CONs served as the reference group), and three variables representing the interaction between each dummy-coded group variable and the B1 model evidence. Classifier evidence from the B1 model takes on values between 0 and 1, with higher values indicating increased classifier evidence that a data point showed a pattern of GMD characteristic of B1. B1 classifier evidence was mean centered before creating the interaction terms to reduce multi-collinearity between the predictor variables. The interaction terms allowed us to test if group membership moderates the association between GMD-classifier output and the examined outcome variables. Given the exploratory nature of these analyses, the model term for B1 classifier evidence (and all other regression model terms) was deemed significant if the 95% interval of unstandardized b values obtained across the 1,000 iterations of the repeated train/test splits did not include 0.

Results

Gray matter density-based classifier performance across the biotypes

The results for the three Biotypes classification models are shown in Fig. 1 and Supplemental Table S3. For the model comparing B1 and CON, overall model classification accuracy was significantly above chance. Classification accuracies were also significantly above chance for both the B1 and CON. Importantly, the model did not classify either B2 or B3 cases as belonging to the B1 group at rates above chance. We interpret this pattern of results as evidence for specificity in discriminating between B1 vs. CON based on GMD features.

Figure 1
figure 1

Raincloud plots from the analysis of the GMD classifiers for the (a) Biotype group 1 (B1) versus Control (CON), (b) Biotype group 2 (B2) versus CON, and (c) Biotype group 3 (B3) versus CON. In each panel, the dots represent balanced classifier accuracy for each of the 1000 iterations, the density plot shows the distribution of accuracy values across iterations, and the black dot and line reflects the mean and 99.17% interval of the accuracy values, respectively. Overall accuracy reflects on the two groups in the model (e.g., overall accuracy for the B1 model is the combined accuracy of B1 and CON cases). The other columns of the figure reflect accuracy for individual groups. Note that accuracy for groups not included in the training model (e.g., B2 and B3 for the B1 model), the ‘accuracy’ value reflects the rate of classifier guesses for being in the psychosis group (B1).

For the model comparing B2 and CON, the overall model classification accuracy was significantly above chance. Classification accuracies were also significantly above chance for both B2 and CON. However, this model did not demonstrate specificity: although B3 cases were not misclassified as B2 at above chance levels, B1 cases were misclassified as B2 significantly above chance. Thus, it appears that classification accuracy of the B2 model was driven by GMD features common to both B1 and B2, relative to CON, and not features specific to B2.

Lastly, overall classification accuracy for the B3 vs. CON model was significantly above chance. However, neither the separate classification accuracies for B3 nor for CON exceeded chance levels. Moreover, the model did not show specificity as B1 cases were misclassified as belonging to the B3 group at rates greater than nominal chance. B2 cases were not misclassified as B3 above nominal chance. This pattern of results suggests that classification performance of the B3 model was driven by GMD characteristics common to both the Biotypes and CON.

Gray matter density-based classifier performance across conventional diagnoses

The results for the three conventional diagnoses classification models are shown in Fig. 2 and Supplemental Table S4. Overall classification accuracy for the SZ vs. CON model, as well as accuracies for both SZ and CON, were significantly above chance. However, the model misclassified SAD cases as belonging to SZ at above chance levels (at a similar rate to SZ cases). BD cases were not misclassified as SZ above nominal chance. Therefore, the SZ vs. CON model appeared to be nonspecific and driven by GMD features common to both SZ and SAD.

Figure 2
figure 2

Raincloud plots from the analysis of the GMD classifiers for (a) Schizophrenia (SZ) versus Control (CON), (b) Schizoaffective disorder (SAD) versus CON, and (c) Bipolar disorder (BD) versus CON. In each panel, the dots represent balanced classifier accuracy for each of the 1000 iterations, the density plot shows the distribution of accuracy values across iterations, and the black dot and line reflects the mean and 99.17% interval of the accuracy values, respectively. Overall accuracy reflects on the two groups in the model (e.g., overall accuracy for the SZ model is the combined accuracy of SZ and CON cases). The other columns of the figure reflect accuracy for individual groups. Note that accuracy for groups not included in the training model (e.g., SZ and BD for the SZ model), the ‘accuracy’ value reflects the rate of classifier guesses for being in the psychosis group (e.g., SZ).

Overall classification accuracy for the SAD vs. CON model was significantly above chance. CON, but not SAD, group was classified at above chance rates. The model misclassified SZ cases as belonging to SAD at above chance level, while BD were not misclassified as SAD. Thus, similar to the SZ vs. CON model, the SAD vs. CON model appeared to be nonspecific and driven by brain structural characteristics common to SZ and SAD.

Lastly, overall classification accuracy for the BD vs. CON model was significantly above chance. However, neither classification accuracy for BD nor CON were above nominal chance. The model also misclassified SZ, but not SAD, as belonging to BD group at an above chance rate. Thus, classification performance of the BD vs. CON model appeared to be driven by GMD features common to BD and SZ, and CON.

Feature weights for the classification models

The voxel-wise feature maps across the Biotype and conventional diagnosis groups (Supplementary Figs. S1 and S2) paralleled the spatial voxel-wise GMD maps from the original VBM analyses5. A stepwise distribution of classifier feature weights was observed across the Biotypes, with the most diffusely distributed, consistent and strong features in B1, considerably fewer consistently strong features in B3, and intermediate number of features in B2 (Supplemental Fig. S1). In contrast, across the conventional diagnoses (Supplemental Fig. S2), the consistent and strong classifier features were more spatially similar to each other, echoing our VBM findings of poor separation of the SZ, SAD and BD groups based on GMD. It is important to note that it is difficult to make claims as to whether the identified features are significant in a statistical sense, or if they are the most important. The contribution of a feature in any machine learning model in part, depends on the other features that are included in the model. Thus, these feature maps and results should not be taken as evidence that these features in isolation would produce the same results reported above in “Gray matter density-based classifier performance across the biotypes” and “Gray matter density-based classifier performance across conventional diagnoses”.

Associations between gray matter density-based classifier performance for biotype 1 vs. controls and biomarker and clinical measures

We next explored whether classifier evidence indicating membership of the B1 group (derived from the B1 model which was the only model to show specificity) was associated with biomarker and clinical measures that were not used in the Biotype development, and if any of these relationships predicted performance regardless of group membership. Table 2 shows the results for all six measures. B1 classifier evidence demonstrated a significant negative association with a single measure, i.e., an estimate of a premorbid general intellectual ability (WRAT-4/Reading Subtest). Importantly, this relationship appeared to be group invariant: there was no evidence of a significant interaction between any group and B1 classifier evidence for WRAT-4 scores. No other significant associations were found.

Table 2 Summary of regression model results. Measures reflect the average beta coefficient across the 1000 iterations with the 95% confidence bounds in parentheses (Lower, Upper).

Discussion

Expanding on our prior work—the development of distinct B-SNIP psychosis Biotypes derived from cognitive and neurophysiologic biomarkers3 and the characterization of brain structural and functional alterations across Biotypes using independent neuroimaging measures5,6—here, we examined whether a supervised machine learning approach applied to voxel-wise GMD measures would successfully classify the Biotype and conventional diagnosis groups. Compared to symptom-based diagnoses, Biotypes appear to capture neurobiologically-distinctive and more homogeneous psychosis subgroups3. Therefore, we predicted that a GMD-based classifier would demonstrate more specificity for Biotypes than diagnoses. We also examined whether GMD-based classifier evidence was associated with several biomarker and clinical measures not used in Biotype definition.

Our results converge with and extend previous work using machine learning approaches to classify psychotic disorders using features derived from structural MRI7,8,9,10,11,12,13,14,20,21,22. First, we replicate previous findings indicating that patterns of GMD discriminate SZ and BD from CON at above chance rates. We also extend prior work by demonstrating that GMD can discriminate SAD [which is typically either not included into such analyses (e.g. Refs.7,10,15) or is merged with SZ cases (e.g. Ref.11)] from CON.

A novel extension of our study is the application of machine learning approaches to the discrimination of Biotypes—experimental neurobiologically-based categories of psychosis3. Notably, the classifier models were trained on whole-brain GMD features that were not used to derive Biotypes. All three Biotype models classified cases at above chance rates, indicating that patterns of GMD can discriminate between biologically-derived subgroups of psychosis (relative to CON). An interesting aspect of these findings is that classification accuracy demonstrated a gradient, such that it was numerically highest for the B1 model, intermediate for the B2 model, and lowest for the B3 model. This echoes both the overall pattern of GMD reductions relative to CON5 and the differing levels of cognitive impairment3 previously demonstrated across the three Biotypes.

One aim of the present study was to determine if machine learning models can capture features that are specific to categories of psychotic disorders. The inclusion of multiple psychosis subgroups allowed us to assess the specificity for both the Biotype- and conventional diagnosis-based classification schemes. We applied each of the models trained to classify one of the groups (e.g. B1 vs. CON) to the other two groups within the same classification scheme (e.g. B2 and B3), which allowed us to determine the rate at which the two groups not included in the model (e.g. B2 and B3) were classified as members of the psychosis group used to train the model (e.g. B1). The logic of this approach is that a model capturing GMD features specific to a particular psychosis subgroup would not lead to above chance “misclassification” of belonging to a different subgroup. Only the B1 model showed evidence of model specificity. That is, the B1 model classified only members of the B1 group, and not the B2 or B3 groups, as belonging to the B1 group at rates above nominal chance. Neither the B2 nor B3, nor any of the conventional diagnosis (see also Ref.42), models showed similar evidence of specificity. The lack of model specificity might explain the consistently modest classification rates observed when attempting differentiate SZ and BD in prior research23.

We also explored if classifier performance predicted individual differences in biomarker and clinical measures that were not used in Biotype creation. This analysis was restricted to classifier evidence from B1 vs. CON because it was the only model that demonstrated evidence of specificity. We aimed to explore whether meaningful relationships exist between the brain structure-based classifier evidence and a series of clinical and biomarker measures that could elucidate neurobiology/behavior interactions specific to individuals expressing B1-like patterns of GMD. Importantly, we chose analytic strategies that would allow capture of dimensional aspects of such relationships, independent of group membership or, indeed, of the distinction between psychosis cases and CON. We found that, among the six tested measures spanning cognition, EEG, ocular-motor, and general and social functioning, a single measure—an estimate of a premorbid general intellectual ability (WRAT-4/Reading Subtest)—demonstrated a negative association with GMD-based B1 classifier evidence. That is, a higher probability of being classified as B1, based on GMD characteristics, was associated with lower estimate of general intellectual ability. Importantly, we did not detect a significant interaction between any psychosis group and B1 classifier evidence for this measure, indicating that the relationship was group-invariant.

Single-word reading as assessed by the WRAT-4/Reading Subtest43 provides a widely used estimate of premorbid intellectual ability (in disease samples)44. It is considered to capture crystalized intellectual ability rather than the fluid/dynamic intellectual functions. While fluid cognitive function [as captured by the Brief Assessment of Cognition in Schizophrenia (BACS)] was used to discriminate the Biotypes3, and there is a relation between measures of fluid and crystalized intellectual ability, the correlations between BACS total scores and WRAT-4/Reading Subtest scores in our sample were moderate with only 10–15% shared variance (B1, r = 0.33, R2 = 0.11; B2, r = 0.31, R2 = 0.10; B3, r = 0.39, R2 = 0.15; all p < 0.05). Thus, measures of these two aspects of cognition each provide important and largely non-overlapping information in psychosis samples. The difference between aspects of cognition captured by WRAT and BACS has been previously used to assess psychosis-related cognitive decline in the B-SNIP sample45. Other reports from our group examined relationships between premorbid intellectual ability/WRAT-4 Reading measure and an array of biomarkers, including structural brain metrics46 and Polygenetic Risk for Schizophrenia47.

The association observed here between premorbid intellectual function and GMD-based probability of being classified as psychosis B1—regardless of group membership (psychosis or CON)—highlights the importance of the relationship between premorbid cognitive development and brain structure. Notably, WRAT-4 was the only measure that showed a significant association with the GMD-based Biotype classification. One important avenue for future research is to examine cognition/brain structure relationships over the course of development with the aim of identifying ‘high risk’ subgroups who may merit clinical monitoring. Another possibility is to explore cognition/brain structure interactions in prodromal and early psychosis samples, to test whether these features predict psychosis progression and broader functional outcomes. Further detailed investigation of relationships between cognition, brain structure and other biomarkers within and across psychosis Biotypes, and the replication of these relationships in independent samples, are essential to validate and extend our findings.

There are some limitations of the study that warrant mention. First, our classifiers were trained only with features derived from a single imaging modality. Anecdotally, previous research using features derived from multiple modalities to classify conventional diagnostic categories led to higher classification accuracy relative to studies using only a single imaging modality16,17,18,19. Using features from a single imaging modality might have underestimated classification accuracy and positively (or negatively) affected the results related to model specificity. Second, we did not directly compare the GMD-based classification outcomes between the Biotype vs. diagnosis categorizations. This requires further work and development in optimizing machine learning approaches to directly compare different classification schemes. Third, our findings require validation in independent samples suitable for “biotyping” based on a broad set of biomarker measures.

Together, our findings indicate that brain-based biomarker classification schemes, such as Biotypes, may hold promise in capturing disease features that are more specific to underlying psychosis neurobiology than are phenomenologically-defined diagnostic categories of psychosis. It is important to note that our results do not support the notion that the Biotype GMD-based classification scheme is superior (i.e., leads to higher accuracy) to the classifier performance possible for conventional diagnoses. Our primary conclusion is that Biotypes appear to be associated with more specificity in brain structure-based classification. This conclusion is based on the B1 model correctly classifying B1, but not B2 and B3 cases, at above chance rates. In contrast, none of the diagnosis classifier models demonstrated features of specificity, based on brain structure features. Future research is needed to further investigate the accuracy and model specificity associated with classification of psychosis based on neurobiological features.