Introduction

Despite decades of research, there are no precise and reliable etiopathophysiological markers for major psychiatric conditions.1 Impeding factors range from inherent challenges in studying complex genetic disorders2 to weakly established neural bases for cognition, experience and behaviour.3,4 However, a part of the problem is a mismatch between current diagnostic standards for psychiatric illnesses and observations emerging from basic systems and behavioural neuroscience research.5 Recognized biological heterogeneity, also adds to the difficulty of identifying reliable biological markers associated with these conditions.6 Treatments for psychiatric disorders have emerged largely as a result of serendipitous observations7 with an unfortunate range of side-effects8 and this may be why mortality and prevalence rates associated with psychiatric illnesses have not decreased in past years,9 as compared to other medical conditions such as certain types of cancer10 or heart diseases.11

In particular, the underlying pathophysiology of schizophrenia, a severe and debilitating psychotic illness, still remains elusive, with few established consistent findings.12 Currently objectively measurable diagnostic tests for schizophrenia13 are lacking, and the reliability of diagnoses based on observable signs and symptoms leaves room for improvement.5 Further, there is marked heterogeneity within clinical manifestations of ‘schizophrenia’ as well as considerable overlap with other psychiatric diagnoses, leading many to question the validity of a singular disease entity.14

In this context, applying machine learning techniques to MRI data has the potential to provide an objective and evidence-based approach for identification and management of schizophrenia.15,16 Machine-learned MRI models have the potential to identify biological markers and delineate symptom clusters. Recently, an increasing number of studies have attempted to classify schizophrenia (vs. healthy controls) based on functional alterations in resting-state brain patterns (Table 1, see supplementary materials for more description of these studies).

Table 1 List of single-site studies that provided machine learning model for predicting schizophrenia using resting-state brain patterns

Most earlier studies assessed patients already undergoing treatment, which means their fMRI scans were confounded with antipsychotic drug effects17 – hence, those scans did not correspond to the point of first medical contact, and so may not lead to optimal diagnostic models. Further, diagnostic models obtained from larger datasets (more than 100 subjects) have classification accuracies well below 80% (Fig. 1). Many have observed this phenomenon: “smaller-N studies reach higher prediction accuracy of schizophrenia with neuroimaging data”.18 Even with higher cross-validated accuracy, the smaller samples likely do not capture the heterogeneity of the disease, which suggests that these models will not generalize well to unseen cases.

Fig. 1
figure 1

Negative correlation between sample size and cross-validated (CV) accuracy for predicting schizophrenia using resting-state brain patterns (studies cited in Table 1)

Many of these studies first parcellate the whole brain resting-state information into spatial regions that are considered homogeneous. However, with the increasing number of parcellation methods and atlases now available, the choice of which parcellation to use seems rather arbitrary. These methods can vary widely in principle and can be based on (a) pre-defined ontology of brain structures such as post-mortem cytoarchitecture,19,20 sulco-gyral anatomy,21,22 anatomical connectivity using diffusion imaging23,24 or (b) data-driven modelling of the functional features in the BOLD signal from resting-state25 or task-based fMRI26,27 or even meta-analyses28,29 using analytical techniques such as hierarchical clustering30 or independent components analysis.31 The quality of the brain network obtained and the downstream predictive model may be largely influenced by the selection of the atlas or parcellation used.32,33 Brain segmentations based on these parcellation schemes not only provide a way to reduce the dimensionality of fMRI data but can also provide an elegant way to incorporate prior neurobiological knowledge to ‘refine’ the features. However, to date, there has been no investigation on whether combined learning from multiple predefined parcellation schemes can provide better performance for diagnostic prediction of schizophrenia.

In this study, we eliminated the potential confound of antipsychotic treatment by using resting state fMRI data collected from a cohort of antipsychotic-naive schizophrenia patients (N = 81) as well as age- and gender-matched healthy controls (N = 93). The aim of our study was to improve accuracy for diagnostic prediction, compared to results reported in the literature, by designing a feature creation and learning pipeline that incorporates prior knowledge of neuroanatomy and neurophysiology. Our overall model involves stacking predictions from several single-source models, each based on the specific set of features related to regional fMRI activity and functional connectivity, and a specific a priori parcellation scheme. We demonstrate that our ensemble model yields a classification accuracy of 87% (vs. 53% chance), which is better than any standard single-source model considered in the study. To the best of our knowledge, (1) the performance of our model, based on 174 subjects, outscores earlier machine learning models built for diagnosing schizophrenia using resting-state fMRI measures that have been learned from datasets of N > 100 subjects; and (2) this is the only such classification model that has been built and validated exclusively on never-treated schizophrenia cases.

Our method relies on a single modality of data acquisition for neuroimaging and is easily scalable as it uses a set of pre-defined atlases—i.e., it does not rely on data-driven brain parcellation methods, such as group-independent component analysis.

Results

We show below that (a) our EMPaSchiz ensemble learner, which learns a combination of learned classifiers, each trained on its own neuroimaging feature extractions and brain parcellation schemes, produces a classifier that can predict schizophrenia more accurately than any of the individual predictors (that used just a single feature/parcellation combination). (b) Within this ensemble prediction framework, even a very small fraction of features (as low as top 0.5% selected via univariate tests) can still provide high prediction accuracy (>80%). (c) This learning framework can also produce models that can distinguish clinically symptomatic versus non-symptomatic patients, with moderate accuracy.

Table 2 presents the 5 × 10-fold cross-validation prediction performance of the various learners in EMPaSchiz. Majority class baseline accuracy for schizophrenia prediction (declaring every subject to be control) was 53.4% (93 controls of 174 total subjects). These accuracy values are plotted in Fig. 2. Stacked models with neuroimaging features that are regional—viz., ALFF, fALFF and ReHO—had accuracies in the range of 74 to 76%, while the ones based on functional connectivity—viz., FC-Correlation, FC-partial correlation, FC-precision—showed better performance with 79 to 84% accuracy. The final ensemble model EMPaSchiz (stacked-multi) showed the best performance with accuracy of 87%, sensitivity of 80%, specificity of 93% and precision of 92%, each with standard errors of 1–2%. This accuracy of stacked-multi was significantly better than second best stacked model (stacked-FC-precision at 84%, t-test, p = 0.03).

Table 2 Model performance (in percentage) and elements of confusion matrix of the various stacked learners in EMPaSchiz model: average (standard errors) − 5 × 10-fold CV
Fig. 2
figure 2

Comparison of 5×10-fold cross-validation prediction accuracies for stacked learners in EMPaSchiz model. The comprehensive ensemble of EMPaSchiz “stacked-multi” is shown in red. Stacked-multi shows the best performance and performs significantly better than all other stacked models (all p < 0.05). The dotted line is the majority class baseline predictor. SEM standard error of mean

Figure 3 shows a comparative profile of accuracies for various SSM predictors along with EMPaSchiz stacked models. (Supplementary material provides results in tabular format as well as plots of comparisons limited to specific feature types. It also provides results for various ensemble learners that were stacked parcellation-wise.) Prediction accuracies for SSM ranged from 52% (FC-precision with harvard_sub_25) to 83% (FC-precision with basc_multiscale_444) and averaged overall at 73%. In general, basc_multiscale atlases showed better performance than the others. For instance, accuracies of EMPaSchiz stacked models were comparable to basc_multiscale_197 models for FC-correlation at 82% and for FC-partial correlation at 79%.

Fig. 3
figure 3

Comparison of 5×10-fold cross-validation prediction accuracies for single-source and multi-source models. The comprehensive ensemble model of EMPaSchiz “stacked-multi” is shown in red. (Horizontal dotted line, at 0.53, is the majority class baseline predictor. SEM Standard error of mean)

We examined the effect of feature selection using top-r percentage of total features based on a univariate test, of r percentile of the highest F-value scores, for r = 0.5%, 1%, 2%, 5%, 10%, 20%, 30%, as well as “all regional features +30% connectivity features” (we chose this combination as, for any given parcellation, the number of regional features was much less than that of connectivity features), and all features (no feature selection). Note that each “setting” is applied to all 84 SSMs. Figure 4 shows the comparative profile of model performances with varying levels of r for top-r percentage of features, along with original EMPaSchiz (stacked-multi) model where feature reduction was done using PCA. (Supplementary materials provide results in tabular format as well as additional plots of comparisons of feature selection methods for SSM and MSM models.) Using all features (r = 100%, i.e., no selection/reduction) showed accuracy of 85% (which was slightly poorer than PCA reduced features at 87% but was not a statistically significant difference) and accuracy declined only slightly when r was reduced gradually to as low as 0.5. It is noteworthy that with only 0.5% of top features, our ensemble prediction framework still showed a high prediction accuracy of 82%.

Fig. 4
figure 4

Comparison of 5×10-fold cross-validation prediction accuracies for stacked-multi models with various levels of feature selection and PCA

Patients with schizophrenia in our sample showed a range of psychopathological symptom severity, as measured using the clinical scales SANS for negative symptoms (integer values from 0 to 110) and SAPS for positive symptoms (integer values from 8 to 55). We used the first and last quartile of these scales to categorize the 20 least, and the 20 most, severely symptomatic patients. We then used our ensemble prediction framework in leave-one-out cross-validation setup to predict the high-symptomatic patients against non/low-symptomatic ones (majority class baseline accuracy of 50%). We used leave-one-out cross-validation (rather than 10-fold) to deal with low number of subjects (N = 40) that were available for this analysis. Prediction accuracy for stacked-multi model was 73.2% for SANS and 61.9% for SAPS of schizophrenia psychopathology.

To identify some of the key pathological alterations in our schizophrenia sample, we estimated the reliability of a feature’s importance for diagnostic prediction, similar to the approach used by an earlier neuroimaging study34 – sorting the features by their respective mean logistic regression weight divided by its standard error for each feature in a particular learned SSM generated during 50 folds of cross-validation. (This was performed with raw ROI data, without any PCA transformations.) Fig. 5 (respectively Fig. 6) highlight some of the top-most ( > 98 or 99th percentile) reliable features using representative atlases for regional resting state measures (respective connectivity).35 However, given the complexity of our ensemble model (which recall is based on 84 SSM), these depictions should be considered just representative in nature, and cannot be claimed as the ‘only’ important features in the model.

Fig. 5
figure 5

Key pathological alterations in schizophrenia suggested by top-most reliable features—elevated (red) and suppressed (blue) changes in regional activity. Panels show top 98th percentile of top regional features. a Higher ALFF in right caudate and right superior temporal pole (aal). b Higher ALFF in lateral aspect of left superior temporal gyrus and horizontal ramus of the right lateral sulcus, and lower ALFF in left posterior-dorsal cingulate gyrus (destrieux). c Higher fALFF in left putamen, right caudate and lower fALFF in right anterior cingulum (aal). d Higher ReHo in left superior temporal pole, right inferior temporal gyrus, and lower ReHo in left inferior parietal lobule and right superior temporal gyrus (basc_multiscale_197)

Fig. 6
figure 6

Key pathological alterations in schizophrenia suggested by top-most reliable features—network edges show elevated (red) and suppressed (blue) changes in functional connectivity. Panels show top 99th percentile of top functional connectivity features using dosenbach and msdl atlases. a Decreased functional connectivity between regions—left ventral frontal cortex and left lateral cerebellum, left occipital and left angular gyrus, left middle insula and right fusiform gyrus, and lastly left post parietal cortex with three nodes namely right frontal gyrus, left parietal, left precentral gyrus. Increased interhemispheric functional connectivity between left superior frontal gyrus and the right anterior insula. b Decreased functional connectivity between regions—striatum and posterior occipital lobe, right intraparietal sulcus and right frontal pole, ventral anterior cingulate cortex and medial default mode network, left temporo-parietal junction and right parietal cortex, right superior temporal sulcus and Broca’s area. Increased functional connectivity between regions—right anterior insula and striatum, right insula and left auditory cortex, and left anterior intraparietal sulcus and posterior occipital lobe

The pattern of functional connectivity changes (Fig. 6) indicates robust hypo-connectivity between the frontoparietal network (such as post parietal) and the sensorimotor network (such as frontal, parietal, precentral gyrus) with widespread hypo-connectivity in language (e.g.: Broca), attention (e.g.: frontal pole, parietal) and default mode network (e.g.: angular, fusiform gyrus). On other hand, the auditory network as well as the anterior insula, which is implicated in high-level cognitive control, attentional processes and saliency,36 show hyper-connectivity. Similarly, the overall picture (Fig. 5) shows increased regional low frequency activity in the superior temporal gyrus and basal ganglia structures - caudate, putamen, and reduced regional activity in cingulum.

Discussion

This study aimed to build a machine learned classifier for diagnosing schizophrenia that depends on a single neuroimaging modality of acquisition - resting state fMRI. Resting state fMRI is a popular imaging method and possibly better than task-based fMRI, since the latter depends on experimental parameters that require standardization. Further, resting state fMRI is not limited by participants’ attention or cognitive ability to perform a task and hence is applicable to patients with more pronounced disabilities.37

Several recent studies have built diagnostic models using data from patients receiving antipsychotic drug treatment (see Table 1). However, antipsychotics are known to affect brain activity and function,38,39 and a recent study cautions against the practice of interpreting brain changes in a medicated state, noting it might not be related to the pure pathology of schizophrenia.17 We developed the model presented in this study on a sample of never-treated schizophrenia patients, to make our results directly apply to realistic clinical scenarios of diagnosis at first clinical presentation. Further work will be necessary to examine how this may generalize to medicated patients, as well as other confounds, such as multi-site batch effects, remains to be examined.40 It is notable, however, that non-medicated patients are an important group for analysis and represent, perhaps, the most difficult sample for recruitment. In this way our study provides a very important sample to demonstrate the value of our approach.

With respect to diagnostic accuracy of schizophrenia, Schnack and others have observed that smaller sample studies may reach high prediction accuracy at the cost of lower generalizability to external samples -- an effect attributed to clinical heterogeneity, physiological variation, sampling noise and errors in diagnosis.18 In our outline of recent literature on machine learning studies with resting-state fMRI (see the Introduction section), we also observed this relation (see Fig. 1). Nevertheless, our ensemble model outscores earlier models built for diagnosing schizophrenia using resting state fMRI measures, even though it was learned from a large sample. We believe this may be because our feature creation process incorporates prior rich neurobiological knowledge with simultaneous use of regional and connectivity measures that are jointly extracted over various biologically-informed brain atlas schemes. We demonstrate that if we employ standard machine learning pipelines (called SSM here) on this dataset of untreated patients, we obtain a level of performance ( < 80% accuracy) that is similar to the results reported widely in earlier studies with comparable sample sizes. Hence, these drug-naive cases are unlikely to be ‘easier’ to model than standard treated cases. Our results provide encouraging progress toward deploying automated or semi-automated diagnostic systems based on neuroimaging and predictive models in psychiatric clinics. However, the performance of our model is favoured by the fact that the entire sample in this study comes from a single site, meaning it does not need to deal with the challenges of cross-site generalizability and site-specific effects. Future clinical studies with larger cohorts, preferably from multiple clinical sites, would be necessary to justify clinical deployment.

Our EMPaSchiz model used brain parcellations that were based on prior knowledge of anatomy / cytoarchitecture or statistical maps extracted from correlation structure in fMRI data collected and analysed in earlier studies. Hence, these maps might not perfectly adapt to signals in the individual subject images – which might not be an issue for data-driven parcellation or clustering techniques. Our study neither explored that option, nor compared model performance empirically, with features obtained with these two alternative methodologies. However, use of pre-existing parcellations reduces chances of overfitting, and possibly increases the robustness of the resulting model. Note also that these a priori ROIs incorporate nicely biological knowledge of fMRI data into the feature creation process, which can help interpretation of results, and provide an effective way to reduce dimensionality. Our model may be readily scaled-up with relatively little computation, as it does not need to build parcellation maps from incoming training images.

It is often challenging to provide a biological interpretation of complex machine learning models, as the goal of the learning process is to find a model that maximizes prediction performance, which may require (possibly non-linear) combinations of thousands of features. In this study, we produced an effective classifier by seeking the coefficients for the features that collectively optimize the predictive accuracy. In general, such coefficients need not correspond to the inherent correlation of each individual feature with the outcome. This is especially true in our approach of using multiple parcellations of the brain, as this means the “features” will overlap to a large degree. This can be seen as potential limitation for the interpretation of our model. We provide only a snapshot of some representative changes in patient’s brain, showing only the most reliable resting state features; features that, alone, may be neither necessary nor sufficient to obtain the prediction performance of the reported ensemble model. However, several of these brain networks and regions were observed to be altered consistently in schizophrenia.41,42,43

Functional connectivity aberrations observed in our study are consistent with the dysconnectivity hypothesis of schizophrenia.44 This theoretical framework describes schizophrenia as a dysconnection syndrome linking aberrations at the level of synapse with the abnormalities in the long-range connectivity of several brain networks.45 A vital component of the dysconnectivity hypothesis is proposed aberrant connectivity between prefrontal cortex and other brain regions, which is posited to give rise to key symptoms such as delusions and hallucinations.46 A systematic review of fMRI studies on functional connectivity supports reduction in brain region connectivity in subjects with schizophrenia, especially reductions involving prefrontal cortex,47 in agreement with our observations. Our findings of concurrent hyper-connectivity among some regions is also consistent with earlier reports of increased functional connectivity in schizophrenia.48 Another core postulate of the dysconnectivity hypothesis is that modulation of synaptic efficacy with resultant fronto-temporo-parietal aberrations leads to hallucinations / delusions in schizophrenia.49 The hypothesized synaptic efficacy aberrations may be linked to NMDA receptor abnormalities.49 In this context it is of interest that effects on temporoparietal-prefrontal circuitry through transcranial Direct Current Stimulation (possibly via NMDA-dependent mechanisms50) has been shown to ameliorate severity of auditory hallucinations,51,52 possibly through “correction” of functional dysconnectivity.53 It is likely that further systematic application of machine learning techniques to analysis of brain connectivity may be useful for developing prognostic markers for schizophrenia that might predict differential responses to clinical interventions.

A general conceptual limitation of machine learning studies in psychiatry is that the diagnostic labels might themselves be ill defined. Amidst an ever-expanding volume of research data, inconsistencies in neurobiological findings fuel doubts about the validity of the currently defined disease construct of schizophrenia. This might be an issue inherent in psychiatric practice, which contributes to low reliability of diagnosis with nosology such as the DSM criteria. The work reported here may indicate a useful step towards more biological informed diagnoses, as it involves developing algorithms to predict current psychiatric diagnoses based on objective neurobiological features. This approach could also provide us with a framework for evaluating the validity of clinical diagnoses. Lastly, our empirical results show that multi-parcellation ensemble learning models may effectively learn models for early diagnosis of schizophrenia; we anticipate that this approach may work for other psychoses, and for prediction of treatment responses.

Methods

Subjects

This study examined 92 patients attending the clinical services of the National Institute of Mental Health & Neurosciences (NIMHANS, India), who fulfilled DSM-IV criteria for schizophrenia and were never treated with any psychotropic medications including antipsychotics. The diagnosis of schizophrenia was established using the Mini International Neuropsychiatric Interview (MINI) Plus,54 which was confirmed by another psychiatrist through an independent clinical interview. The details related to illness onset and antipsychotic-naive status were carefully ascertained by reliable information obtained from at least one additional adult relative. The Scale for Assessment of Positive Symptoms (SAPS) and Scale for Assessment of Negative Symptoms (SANS) were used to measure psychotic symptoms.55 Clinical assessments and MRI were performed on the day before starting antipsychotics.

Controls were recruited from among the consenting healthy volunteers from the same locale to match for age and sex. We used 102 age- and sex-matched healthy volunteers, who were screened to rule out any psychiatric diagnosis using the MINI as well as a comprehensive mental status examination. For both cases and controls, we recruited only right-handed subjects to avoid the potential confounds of differential handedness. None of the study subjects had contraindications to MRI or medical illness that could significantly influence CNS function or structure, such as seizure disorder, cerebral palsy, or history suggestive of delayed developmental milestones. There was no history suggestive of DSM-IV psychoactive substance dependence or of head injury associated with loss of consciousness longer than 10 min. No subjects had abnormal movements as assessed by the Abnormal Involuntary Movements Scale. Pregnant or postpartum females were not included. The supplementary material provides a table with details of demographic and clinical profile of 174 subjects who qualified to be included in the study. (See details on excessive head movement in the ‘Image pre-processing’ section)

The catchment area for the subject recruitment involved the southern states of India. We obtained informed written consent after providing a complete description of the study to all the subjects. The NIMHANS ethics committee reviewed and approved the original research protocol. The Research Ethics Board at University of Alberta, Edmonton approved the secondary analysis of archived data.

Image acquisition

Magnetic Resonance Imaging (MRI) was done in a 3.0 Tesla scanner (Magnetom Skyra, Siemens). Resting State Functional MRI: BOLD (Blood Oxygen Level Dependent) sensitive echo-planar imaging was obtained using a 32-channel coil for a duration of 5 minutes 14 s, yielding 153 dynamic scans. The scan parameters were: TR = 2000ms; TE = 30ms; flip angle = 78°; Slice thickness = 3 mm; Slice order: Descending; Slice number = 37; Gap = 25%; Matrix = 64 × 64 × 64 mm3, FOV = 192 × 192, voxel size = 3.0 mm isotropic. Subjects were asked to keep their eyes open during the scan. For intra-subject co-registration, structural MRI: T1-weighted three-dimensional high-resolution MRI was performed (TR = 8.1 msec, TE = 3.7ms, nutation angle = 8°, FOV = 256 mm, slice thickness = 1 mm without inter-slice gap, NEX = 1, matrix = 256 × 256) yielding 165 sagittal slices.

Image pre-processing

We performed pre-processing and feature extraction using MATLAB (The MathWorks, Inc) toolboxes including Statistical parametric mapping (SPM8, http://www.fil.ion.ucl.ac.uk/spm), Data Processing Assistant for Resting-State fMRI (DPARSF)56 as well as Python toolboxes including the nilearn package57 based on scikit-learn, a Python machine learning library.58 We checked acquired images visually for artefacts such as incomplete brain coverage or ghosting; then re-orientated the origin to the anterior commissure in structural MRI and fMRI images. The first ten volumes of each functional time-series were discarded as they were before the time required for the scanner field to reach steady magnetization, and for the participants to adapt to scanning noise. Images were then pre-processed with slice-timing correction, image realignment to correct for motion, and intensity normalization. Since head movement may lead to group-related differences,59,60,61 we excluded images for 11 patients and 9 controls from the study based on excessive head movement (translational > 2.0 mm and/or rotational > 2°).62 This yielded a total of 174 subjects: 93 controls and 81 patients. Functional images were co-registered with the structural image and then normalized to MNI space resampled to 3×3×3 mm3. Nuisance regression was performed to remove noise in the signal induced by head motion using 24 regressors derived from the parameters estimated during motion realignment, scanner drift using a linear term, as well as global fMRI signals from white matter and cerebrospinal fluid segments using SPM’s new segment method.63 Normalized images were smoothed, detrended and band-pass filtered as appropriate—depending on the feature to be extracted, see details below.

Feature extraction

To obtain neurobiologically relevant features, we projected each resting brain information into 14 different parcellations, each based on a specific a priori defined atlas or set of regions of interest (ROIs). Our goal here was to jointly learn from this entire set of neuroimaging features extracted through several brain parcellation schemes to obtain an accurate model; n.b., we are neither trying to compare nor evaluate the influence of any single feature type or ROI definition on prediction accuracy. Our goal is to produce a predictive model whose validation is only its predictive accuracy.

We used the following 14 pre-defined brain parcellation schemes:

  • yeo: intrinsic functional connectivity of cerebral cortex25

  • smith20, smith70: functional networks during activation and rest (at two different resolutions)26

  • harvard_cort_25, harvard_sub_25: Harvard-Oxford cortical and subcortical parcellation (http://www.cma.mgh.harvard.edu/fsl_atlas.html)

  • msdl: multi-subject dictionary learning for functional parcellation64

  • aal: macroscopic anatomical parcellation of single-subject brain65

  • basc_multiscale_122, basc_multiscale_197, basc_multiscale_325 and basc_multiscale_444: multi-level bootstrap analysis of stable clusters in resting-state fMRI, at four different resolutions66

  • destrieux: sulcal depth-based anatomical parcellation of the cerebral cortex67

  • dosenbach: multivariate pattern analysis of functional connectivity28

  • power: graph measures of functional brain organization68

For each of these 14 parcellation schemes, we extracted 3 regional-based and 3 connectivity-based resting brain fMRI features. For regional features, we used:

  • ALFF: amplitude of frequency fluctuations

  • fALFF: fractional ALFF

  • ReHo: regional homogeneity

We smoothed each functional image using a 4 mm FWHM gaussian kernel (except for extraction of ReHo - to avoid overestimation of spatial homogeneity) and band-pass-filtered fMRI time-courses at 0.01–0.08 Hz to capture slow fluctuations that are believed to reflect spontaneous brain activity.69,70 ALFF was calculated as total power within the frequency range between 0.01 and 0.08 Hz to estimate the strength of low frequency oscillations.71 fALFF was calculated as power within the low-frequency range (0.01–0.08 Hz) divided by the total power in the entire detectable frequency range.69 Lastly, ReHo was calculated using Kendall’s coefficient of concordance,72 as a measure of the similarity between the time series of a given voxel and its nearest neighbours.73

We calculated each of these features at the voxel level using the DPARSF toolbox, standardized and then averaged over an ROI. For each ROI, we ran a nuisance regression across the features to remove the effects of confounding variables that are generally recommended and commonly reported in neuroimaging research—age, sex, and total intracranial volume.74 In addition, we also used average framewise displacement to (at least partially) counter systematic yet spurious correlations in functional connectivity that may arise from subject motion.59

We also computed connectivity features with each of the 14 parcellations, by extracting average time series per ROI and then estimating functional connectivity matrices between each pair of regions using one of three statistical measures

  • Pearson correlation

  • partial correlation

  • precision

In each case, the feature vectors were the flattened lower triangular part of these symmetric matrices.

We chose to study the above features as earlier literature established their relevance to schizophrenia pathology. Abnormalities in low-frequency oscillations70,75 and regional homogeneity of blood-oxygen-level-dependent signals76,77 have been well documented in schizophrenia. Further, patients diagnosed with schizophrenia have exhibited changes in functional brain connectivity, as revealed through distant correlations.77,78 In addition to simple Pearson correlation, we described the connectivity structure using partial correlation, which measures the interactions between two ROIs. We use a sparse precision matrix—i.e., the sparse inverse of the covariance matrix—which reveals the brain regions that appear conditionally independent given all other brain regions.79

So, in total, our approach ‘Ensemble algorithm with Multiple Parcellations for Schizophrenia prediction’, abbreviated as: EMPaSchiz (read as ‘Emphasis’) – modelled 84 sources of data (14 parcellation schemes×(3 + 3) feature types) per subject; these descriptions ranged in size from 17 to 98,346 values. We used appropriate masker classes57 to summarize brain signals from non-overlapping clusters (e.g.: basc_multiscale) or overlapping networks (e.g., smith) or spheres centred at seeds with fixed small radius (e.g.: power). Table 3 presents the total number of features per data source. (The supplementary material presents visualizations of a few representative parcellations, overlaid over an MRI slice.)

Table 3 Number of features extracted for regional and connectivity feature-types from each parcellation scheme

Prediction and evaluation framework

EMPaSchiz produced a classifier from our multi-source data, in two levels. For the first level, EMPaSchiz trained 84 different L2-regularized logistic regression classifiers, using the ‘liblinear’ solver80 – one for each individual data source to predict the diagnosis; we consider each to be a single-source model (SSM). For the second level, EMPaSchiz then trained a single L2-regularized logistic regression model to take the prediction probabilities computed by each SSM, to predict the schizophrenia-vs-normal label; hence, this is a multi-source model (MSM). Figures 7 and 8 show schematic representations of our prediction and evaluation framework. These computations were performed using the scikit-learn package36 and mlxtend extensions.81

Figure 7a shows performance of learned EMPaSchiz-Performance model. Given a resting state fMRI time series for a subject, the EMPaSchiz-Performance first extracts 6 different feature types (F1 to F6; coded here with different fill colours) over each of 14 brain parcellation schemes (P1 to P14; coded here with border colour) to obtain 84 feature sets (FS1,1 to FS6,14). Each is given to a “single-source model” (SSM), which is a learned logistic regression (LR) classifier of the PCA-projection of that data with learned parameter θi,j (i.e, θ1,1 to θ6,14 each correspond to a specific feature set) trained to predict schizophrenia. This produces a vector of the resulting 84 prediction probability values (P1,1 to P6,14)—one from each LR—which is given to a final trained LR classifier with learned parameter θ*,*. The final prediction probability P*,* is used to predict whether the given subject is “schizophrenia” or “normal”. We also considered 6 other multi-source models, with learned parameters θ1,* to θ6,*—one for each feature type.

Fig. 7
figure 7

Schematic representation for performance of learned EMPaSchiz model a and process of learning that model b, c—see main text for explanation

Figure 7b, c shows the process for learning the EMPaSchiz-Performance model. The EMPaSchiz-Learner first learns 84 different single-source models SSMi,j: For the ith feature type (i = 1..6) and the jth parcellation (j = 1..14), EMPaSchiz-Learner computes the (i,j)-feature set for the resting state fMRI time series for each of the K labelled subjects in training set, to obtain the feature sets FS*i,j = { FSk i,j } over k = 1..K. It then trains a regularized logistic regression (LR) model θi,j to predict schizophrenia, from each feature set FS * i,j, where the regularization strength C is obtained using internal CV. For example, θ3,12 is learned by fitting LR on FS*3,12 (which corresponds to the 3rd feature type: ReHo with the 12th parcellation: destrieux). After learning all the 84 SSM parameters {θi,j} in this manner, EMPaSchiz-Learner as shown in Fig. 7c, then runs each of these 84 resulting SSMs on each of the K training instances; this produces a new training set P = {P k i,j }, where P ki,j is the probability produced by running the (i,j)-th SSM predictor, with learned parameter θi,j, on the k-th instance. It then learns the multi-source model (MSM) by training the regularized logistic regression (LR) on the set P to predict schizophrenia. This produces the parameter θ*,*. Similarly, six other MSMs θ1,* to θ6,* are learned by training LR with each set P *1,j = {P k 1,j } over k = 1..K, j = 1..14 to P * 6,j = {P k 6,j } over k = 1..K, j = 1..14.

In more detail: EMPaSchiz first used singular value decomposition of each data source to project it to a lower dimensional space. We extracted principal components from the training instances, then projected each instance onto the eigenvectors (PCA). We used all the components—i.e., set the number of principal components to the smaller of the number of original features or the number of instances. Note these components captured all the variance, but reduced the dimensionality by a huge factor, for most datasets, as the final number of features for each data source was at most the number of instances in training set (~157 subjects in our 10-fold cross-validation). For the few data sources that had fewer features than training instances (e.g., yeo-regional has 17 features), this transformation would not change the number of features, but changed the data to a new basis. The motivation for this procedure was to have a uniform pipeline of PCA transformations for all data sources, irrespective of the varying number of features.

Fig. 8
figure 8

Schematic representation for evaluation of the model with cross-validation. We use 5×10-fold cross-validation to evaluate each learner (the original EMPaSchiz-Learner, and its six variants). Here, we first divide the entire dataset of 174 subjects into 10 sets; we then use 9/10 of them to train the EMPaSchiz model (see Fig. 7b, c); we then run that model on the remaining 1/10 of the data (see Fig. 7a). We then compute accuracy as the number of correctly labelled instances, over all 10 folds, and use this as an estimate of the score of the learned EMPaSchiz -Performance system. We run this entire process five times—over five different partitionings and compute the overall accuracy of predictions over these 50 train-test splits. Trained models are depicted in green lines and predictions are depicted in red lines

For SSM, EMPaSchiz set the C parameter (inverse of regularization strength) by internal 10-fold cross-validation on the training split (5 shuffled iterations). We call the MSM that combined predictions from all 14 x 6 = 84 SSMs, ‘stacked-multi’. We also considered six other versions of MSM, each combining SSMs for a specific feature type (14 each): stacked-ALFF, stacked-fALFF, stacked-ReHo, stacked-FC-correlation, stacked-FC-partial correlation, stacked-FC-precision.

The EMPaSchiz model was evaluated in five shuffled iterations of a 10-fold balanced cross-validation approach (90% training set, 10% test set; for a total of 50 train-test splits). We evaluated the model’s generalization performance on the test set (in outer cross-validation), computing:

  • accuracy (Overall, how often is the classifier correct?)

  • sensitivity (When the actual label is ‘patient’, how often is the prediction correct?)

  • specificity (When the actual label is ‘control’, how often is the prediction correct?)

  • precision (When the predicted label is ‘patient’, how often is the prediction correct?)

For each variant, we report the mean and standard errors for these metrics over all 50 train-test splits. To compare MSM models, we used parametric statistical tests (two sided t-test) on the accuracy, using the SciPy package.82

We also performed two additional analyses. First, we explored the effect of feature selection with respect to SSM, using the top-r percentage of the total set of features, based on univariate testing (F-value score) on the model performance. For example, when r = 20%, the EMPaSchiz-Learner would use only 20% of the original features, in each of its 84 SSMs. Note this is instead of using PCA. (So, for the regional features of the ‘aal’ parcellation, instead of using all 116 features, it only considered the top 0.2 × 116 = 23 features, etc.) While computing the cross-validation scores, we ran the feature selection process ‘in fold’ using the ‘pipeline’ class of scikit-learn58 to avoid obtaining optimistically biased estimates. Second, we examined our ensemble prediction framework to distinguish the least symptomatic schizophrenia patients vs. the most symptomatic patients (based on SAPS and SANS); evaluated using leave-one out cross-validation.