Multi-biological classification for the diagnosis of schizophrenia using multi-classifier, multi-feature selection and multi-cross validation: An integrated machine learning framework study

Finding effective and objective biomarkers to inform the diagnosis of schizophrenia is of great importance yet remains challenging. However, there is relatively little work on multi-biological data for the diagnosis of schizophrenia. This was a cross-sectional study in which we extracted multiple features from three types of biological data including gut microbiota data, blood data, and electroencephalogram data. Then, an integrated framework of machine learning, consisting of five classifiers, three feature selection algorithms, and four cross-validation methods was used to discriminate patients with schizophrenia from healthy controls. Our results showed that the performance of the classifier using multi-biological data was better than that of the classifiers using single biological data, with 91.7% accuracy and 96.5% AUC. The most discriminative features (top 5%) for the classification include gut microbiota features ( Lactobacillus , Haemophilus , and Prevotella), blood features (superoxide dismutase, monocyte-lymphocyte ratio, and neutrophil), and electroencephalogram features (nodal local efficiency, nodal efficiency, and nodal shortest path length in the temporal and frontal-parietal areas).The proposed integrated framework may be help in understanding the pathophysiology of schizophrenia and developing biomarkers for schizophrenia using multi-biological data.


Introduction
Finding effective and objective biomarkers to inform the diagnosis of schizophrenia (SZ) is of great importance yet remains challenging [1 2] . It is now broadly accepted that gut microbiota abundance, inflammation, immunological factors, and functional brain networks are altered in SZ [3][4][5][6] , however, most of these alterations are observed at the group level with great variability among individuals with the same phenotypic diagnosis Consequently, none so far has proven to have the ability to reliably aid in the differential diagnosis of SZ [1 7] . Therefore, it is important to analyze how gut microbiota abundance, inflammatory factors, immunological factors and functional brain networks behave at an individual level; for example, this information could be used to better understand the pathology and identify objective biomarkers for the clinical diagnosis of SZ [8] .
Recently, pattern recognition based on machine learning has gained increasing attention, which is well suited for the identification of subtle patterns of information in the data and, as a consequence, is useful to better predict diagnosis at an individual level [1 9-11] . Using a variety of biological data, such as gut microbiota data, blood data, and electroencephalogram (EEG) data, along with machine learning techniques, hundreds of studies have been performed attempting to achieve accurate classification of patients with SZ [8 12-14] . For instance, one study [15] applied the 1-norm support vector machine (SVM) method based on EEG signals of 64 channels during a working memory task to classify patients with SZ versus healthy controls and an accuracy of 87% was achieved. Another study [16] used Boruta variable selection to select the most discriminatory taxa and random forests to develop a classifier and predict SZ based on the important microbiota features. The receiver operating characteristic curve analysis demonstrated that 12 significant microbiota biomarkers were capable of being used as diagnostic factors. A recent study [8] developed a probabilistic multi-domain data integration model consisting of immune and inflammatory biomarkers in peripheral blood and cognitive biomarkers using machine learning to discriminate patients with SZ from healthy controls (HCs).
Despite these advances, previous discriminative studies of SZ have primarily focused on biomarkers extracted from one kind of biological data, which only capture partial information about the human body and therefore influence the resulting classification performance.
Currently, increasing evidence has shown that the combination of multimodal imaging data might further improve the classification performance [17][18][19][20] . Given the complexity of psychiatric disorders, these findings have led to the hypothesis that combining multi-biological data will also improve our ability to understand SZ with promise for biomarker identification. However, this hypothesis lacks direct evidence, since only a few studies have used a combination of multibiological data for a discriminative study of SZ.
In this study, an integrated framework of machine learning consisting of multi-biological data, multi-biological features, multi-classifiers, multi-feature selection algorithms and multi-cross validation methods (5M), was used to discriminate SZ patients from HCs. We hypothesized that combining multi-biological data may improve the classification performance compared to previously used techniques.

Participants
The final sample comprised 99 participants including 49 SZ patients and 50 HCs. The SZ patients were recruited from the Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou and met the diagnostic criteria of the fourth edition of the Diagnostic and Statistical Manual of Mental Disorder-IV-Text Revision (DSM-IV-TR). The psychopathology and symptom severity of the patients were evaluated with the positive and negative syndrome scale (PANSS) and the psychiatric symptoms were steady for >2 weeks; the PANSS evaluated the rate of change at ≤20% over 2 weeks and the total score on the PANSS was ≥30.SZ patients were excluded if they met any of the following criteria: (1) any other psychiatric Axis I disorder meeting the DSM-IV criteria; (2) constipation, diarrhea, diabetes, hypertension, heart disease, thyroid diseases or any somatic diseases; (3) a history of epilepsy, with the exception of febrile convulsions; (4) a history of having received electroconvulsive therapy in the past 6 months; (5) lactating, pregnant, or planning to become pregnant; (6) alcohol dependence; or (7) noncompliant drug administration or a lack of legal guardians.
The HCs were solicited from the local community through advertisements and were screened for their family clinical history and a history of mental illness. All healthy subjects had no history of brain disease (such as pain, schizophrenia, concussion, brain trauma, etc.), ocular disease, psychotropic medication and drug abuse. In addition, the subjects were asked not to drink alcohol, tea, coffee or any other food or drugs that might excite the central nervous system within 48 hours before the experiment and that they get enough sleep the night before.
The study protocol was approved by the ethics committees of the Affiliated Brain Hospital of Guangzhou Medical University, and written informed consent was obtained from each subject or their legal guardian prior to the study.

Multi-biological Data Acquisition and Preprocessing EEG Recording and Preprocessing
Three minutes of resting EEG with eyes closed was recorded from 16 scalp electrodes (i.e., Fp1,

Fecal Sample Collection and Preprocessing
Fresh fecal samples were collected from all subjects and then were stored at -80℃ until DNA extraction. A total of 200 mg of each fecal sample was used for DNA extraction.
The DNA extraction method was consistent with our previously published report [4] . Sequencing of the V4 region of the16S rRNA gene was performed on the Illumina MiSeq platform. The row sequences were processed using QIIME2 (Version 2018.6). Then, we used a pretrained Naïve Bayes classifier for taxonomic analysis, and this classifier was trained on the Greengenes database (Version 13.8). The raw sequence data reported in this article have been deposited in the GenBank in the National Center for Biotechnology Information (NCBI), under accession numbers MT545156-MT547172, which are publicly accessible at https://www.ncbi.nlm.nih.gov.

Blood Collection and Preprocessing
Three milliliters of blood was drawn from control subjects and patients by simple venipuncture between 7.00 and 9.00 a.m., after overnight fasting and tobacco abstinence for more than 12 h.
Blood biochemical indicators were detected by an automatic biochemical analyzer.

Multi-biological Features Extraction EEG Features Extraction
In a brain functional network based on EEG signals, each channel can be considered as a node, and the correlation between two nodes can be regarded as the connecting edge. In this study, we used the phase-locked value (PLV) method to quantify the functional connectivity (FC) between any two channels of EEG signals, as shown in Figure1.
Instantaneous phase ∅( ) can be calculated from signal ( ) by using the Hilbert transform: phase is computed by the following expression: Phase synchronization is defined as the locking of phases of two oscillators: The phase-locking value (PLV) is defined as: Where denotes the imaginary unit, N means the total number of samples, and ∆ is the bespeak time between the successive samples j from 1 to N-1.
In this study, the network threshold was determined by the method of network sparsity.
According to the E-R random graph model, the sparsity of a fully connected network was not less than 2lnN⁄N, where N represented the number of nodes. In this study, the number of network nodes was 16, so the minimum connection sparsity was 34%. To ensure that the constructed network met the attributes of the small world, the maximum network sparsity was 73%. This study analyzed networks with a sparsity of 34% to 73%, and constructed different networks within the above range with 1% as the step size. To obtain the characteristic value, we selected the upper and lower thresholds to calculate the area under the curve of each attribute.
Global attributes, including the global clustering coefficient (aCp), shortest path length (aLp), global efficiency (aEg), local efficiency (aEloc) and nodal attributes, including the nodal clustering coefficient (aNCp), nodal shortest path length (aNLp), nodal efficiency (aNe), nodal local efficiency (aNLe), and degree centrality (aDc) were computed as EEG features, with reference to the research of Rubinov M [21] . In this study, 64 global attributes and 640 nodal attributes were initially computed for analysis. Among them, any feature that was missing any participant was removed. This filtering resulted in 48 global attributes, and 526 node attributes that were selected for the final analysis. Therefore, in the selected set, there were no missing values and no imputation process was required.

Gut Microbiota Features Extraction
Through gene sequencing technology, 171 species of microbiota markers were obtained from all subjects. Among them, any microbiota marker that was missing in more than 85% of the participants was removed. Ninety-four microbiota markers were removed, and 77 gut microbiota markers were selected for the final analysis.

Blood Features Extraction
The numbers of white blood cells (WBC), neutrophils (NEU), lymphocytes (LYM), platelets (PLT) and monocytes (MON) were recorded from complete blood counts after routine blood tests. Four indicators of blood inflammation and immunity, including the neutrophillymphocyte ratio (NLR), platelet-lymphocyte ratio (PLR), monocyte-lymphocyte ratio (MLR) and systemic immune inflammation index (SIII), were calculated based on the above five cell numbers. Moreover, the oxidative stress indicators, including superoxide dismutase (SOD), homocysteine and C-reactive protein (CRP), were also detected in the collected serum. In conclusion, we collected a total of 12 blood features for the final analysis.

Statistical Analysis
Statistical analyses were conducted using SPSS software version 22 (IBM). Statistical tests were 2 tailed and included the two-sample t-test and  2 test. Comparison of sex distribution between the two groups was performed using the  2 test. Comparisons including age and education years between the two groups were performed using a two-sample t test. Unless otherwise specified, significance of all tests was set as p < 0.05, or FDR corrected p < 0.05.

Machine Learning
We developed an integrated framework of machine learning based 5M to discriminate SZ patients from HCs (Figure2). In brief, the framework involved 3 phases: data preparation phase, model training phase, and independent model testing phase.

Data Preparation Phase
Data preparation included feature extraction and subject grouping. We extracted three types of biological features from fecal data, blood data, and EEG data, namely, gut microbiota features, blood features, and EEG features, respectively. For the final analysis, 77 gut microbiota features, 12 blood features, and 574 EEG features were selected for the final analysis. Three types of biological features were used as input features of machine learning, either individually or in combination, to form four input feature sets. At this stage, we randomly split the set of participants into two groups, a training dataset and an independent testing dataset, at a ratio of 3:1. The training dataset was used to train the model parameters, and the independent testing dataset was used to evaluate the performance of the trained model.

Model Training Phase
The specific details of the model training phase and independent model testing phase are shown in Figure 3. The model training procedures included three steps: multi-feature selection algorithms, multi-classifier, and multi-cross validation methods.

Multi-cross Validation
To ensure that there was a large enough sample size to train the model and prevent overfitting caused by insufficient training, multi-cross validation methods were performed in the training set, including 10-fold, 5-fold, 3-fold and leave one out methods.
Several combinations of the aforementioned procedures were investigated for optimized data analysis. PCA and RFE feature selection algorithms could not be carried out due to the small dimension of blood features. As a result, a total of 280 models were obtained based on four input feature sets, five classifiers, three feature selection algorithms and four cross-validation methods. Model training in the second phase was performed with their application restricted to the training data set.

Independent Model Testing Phase
In the third phase, we used an independent testing dataset to estimate the generalizability of 280 models arising from the second phase. To quantitatively estimate the performances of all the methods mentioned in this study, we utilized the metrics of accuracy, sensitivity and specificity.
Moreover, we plotted the receiver operating characteristic (ROC) curves and then calculated the area under the curve value (AUC) for each classification situation to examine the possibility of correctly discriminating SZ patients and HCs.
A permutation test was applied to evaluate the statistical significance of the classification results.
In our analysis, we disrupted the labels of all samples 1000 times, and the p value was computed as the proportion of accuracies that were no less than the accuracy obtained by the original data.
The statistical significance was set at p < 0.05. All the automatic classification work was performed on NEURO-LEARN (https://github.com/Raniac/NEURO-LEARN [22] ), which is a solution for collaborative pattern analysis of neuroimaging data.

Result Participants
The resulting data set comprised 99 participants, including 49 Table   1 for a detailed description of other characteristics.

Classification Results and Analysis
We used an independent testing dataset to estimate the generalizability of the total 280 models.
The classification performance under the 10-fold cross validation method, 5-fold cross validation method, 3-fold cross validation method, and leave-one-out cross validation method (eTable 1, eTable 2, eTable 3, and eTable4 in the Supplementary) were obtained. There was no significant difference among the results of multi-cross validation methods. Table 2 shows the classification performance of the model obtained using different input features under the 10-fold cross validation methods. The optimal classification performance was achieved when multi-biological features were combined as input features, with 91.7% accuracy, 91.7% sensitivity, 91.7% specificity, and 96.5% AUC. The performance of the classifier based on multi-biological features was better than that of the classifiers using single type of biological feature (Figure 4). In addition, we found that the blood features achieved the best classification when using a single type of biological feature, with an accuracy of 83.3% and an AUC of 87.5%.
When gut microbiota features, blood features, and EEG features were used as input feature sets alone, the classifiers and feature selection algorithms of the optimal model were inconsistent, which may be due to the heterogeneity of biological data. The SVM, LR and RF classifiers without using any feature selection algorithm had better classification performance when using combined features, with AUCs over 90%.

Discriminative Features
In this subsection, the most informative features selected to differentiate the SZ patients from HCs are reported. We discuss the most discriminative features from the optimal model that were generated when combined features were used. For quantitative analysis, the top 34 (5% of the total number of features) commonly selected features are summarized in Table 3, which shows the top 34 features for classification are listed in descending order of their weights, including 14 gut microbiota features, 8 blood features, and 12 EEG features.

Discussion
To the best of our knowledge, this is the first discriminative study of SZ by combining multibiological data of gut microbiota data, blood data, and EEG data. We developed an integrated framework of machine learning to discriminate SZ patients from HCs. The main findings of this study are as follows: 1) using the combination of three types of biological features as input features for the classification, the best performance was achieved, with 91.7% of accuracy, 91.7% of sensitivity, 91.7% of specificity, and 96.5% of AUC.
In this study, we developed an integrated framework of machine learning using a combination of multi-biological data, which is a promising direction for the identification of biomarkers for the diagnosis, prognosis, and treatment of SZ patients. A recent study indicated that the diagnosis of SZ can be predicted with possible clinical utility by a computational machine learning algorithm using the combination of blood and cognitive biomarkers; more importantly, the integration of multi-logical data outperforms a single type of biological data, which is consistent with our findings [8] . Interestingly, an early SVM-based prediction of the later development of SZ in a familial high-risk cohort is possible and can be improved by combining schizotypal and neurocognitive features with neuroanatomical variables [23] . In summary, based on the integrated framework of machine learning, the combination of multi-biological data substantially improves the classification performance for schizophrenia patients. Our results revealed that the features from multiple biological datasets provided complementary information and can help to develop effective and objective biomarkers for the clinical diagnosis of SZ [1] .
To date, although there are numerous discriminative studies of SZ using either data of bloodbased [24][25][26] , or neuroimaging data [20 27-30] , few studies have investigated the potential of biomarkers for the diagnosis of SZ using gut microbiota data. In our study, among the top 34 features shown in Table 3, gut microbiota features accounted for a large proportion, indicating that they played an important role in classifying patients with SZ from HCs. An increasing amount of evidence suggests that the gut microbiota bidirectionally communicates with the central nervous system through the microbiome-gut-brain axis (MGBA), thereby influencing brain function and behavior [31 32] . Recently, a few studies have focused on the role of the MGBA in SZ and revealed several alterations in the gut microbiota in SZ patients [33][34][35][36] . These reports of altered gut microbiotas are consistent with those identified in this study for which the most informative features of gut microbiota include Lactobacillus, Haemophilus, Collinsella, Clostridium, and Prevotella. Furthermore, Yuan et al. [37] have shown that changes in the gut microbiota and its metabolites may cause neuronal damage. Lactobacillus can stimulate TNF production, based on this, Lactobacillus may induce changes in inflammatory factors that induce SZ [38] . On the other hand, short-chain fatty acids (SCFAs), the primary bacterial metabolites produced, can enter the central nervous system through the blood-brain barrier (BBB) [39] . Clostridium is the main source of propionate in the gut, indicating that Clostridium may influence the BBB and act on the brain by regulating SCFAs. In addition, Collinsella has been shown to produce the proinflammatory cytokine IL-17a and to alter intestinal permeability by promoting the release of neurotransmitters produced by gut microbiota [40] , thereby acting on the central nervous system. Above all, these investigations suggested that the gut microbiota may affect the central nervous system by acting on several pathways, providing a physiological basis for validating the use of the gut microbiota as a biomarker in the classification of the two groups.
Among the blood features we extracted, those that contributed the most to classification included superoxide dismutase (SOD), monocyte-lymphocyte ratio(MLR), monocyte(MON), neutrophil(NEU), C-reactive protein(CRP), white blood cell (WBC) and neutrophillymphocyte ratio(NLR), which is consistent with previous studies using conventional univariate statistical analysis. Numerous studies and evidence suggest that the oxidative stress contributes to the pathogenesis of SZ, and abnormalities in antioxidant enzymes, including SOD activities, are frequently found in patients diagnosed with SZ [41][42][43] . A previous study [44] indicated that SOD activities remained lower in patients with SZ and may be an important indirect biomarker of oxidative stress in SZ. The present findings provide additional evidence of increased oxidative stress in SZ. Blood inflammatory and immune system abnormalities in patients with SZ have been widely reported, which lead to an increase in various inflammatory markers. NEU count was demonstrated to be increased in chronic SZ patients [45] . Increased MON count has been reported in chronic SZ patients as well [46 47] . Furthermore, a moderately increased CRP in SZ patients compared to HCs has been observed [48][49][50] . It has been indicated that subjects with SZ have significantly elevated WBC. MLR and NLR have recently been used as indicators of inflammation, and predictors of cardiovascular disease, the leading cause of death in SZ. A recent meta-analysis investigated there was a significant increase in NLR in patients with SZ [51] . Elevated MLR and NLR have been observed in SZ, suggesting an increased inflammatory response in SZ [45] . Our experimental results are consistent with these studies. Table 3 shows that the EEG features with heavy weight primarily derive from the delta and alpha2 frequency bands, and partly from the beta and gamma frequency bands. Previous investigators found increases in delta and theta waves, decreases in alpha waves and increases in beta and gamma waves in schizophrenic individuals [8 12 29 30 52 53] . Moreover, the most prominent change was in the spectral power of the delta wave, which may support the development of a biological marker for diagnosing patients with SZ [29 30 54] . In addition, among these EEG features, node attributes including nodal local efficiency (aNLe), nodal efficiency (aNe), nodal clustering coefficient (aNCp), and degree centrality (aDc), contributed most to classifying SZ. EEG studies have shown that in the resting state, small world attributes in SZ patients were disrupted, with a lower clustering coefficient and a longer shortest path length [55] .
In addition, global and local efficiency in SZ patients is lower compared to healthy people [56] .
The most discriminative EEG features in Table 3 are primarily concentrated in the temporal lobe and partly in the frontal lobe. Abnormalities in temporal and frontal lobe function and structure have been widely reported in SZ [57] The frontal and temporal lobes are primarily associated with higher cognitive functions, among which the temporal lobe was associated with hearing and language functions, which has been confirmed by MRI studies [58] . These results are in accordance with previous structural and functional neurological findings.

Limitations
The present study has several limitations. First, since this is a cross-sectional study, we cannot infer causality. There is evidence that immuno-inflammatory markers are altered from the beginning of SZ, and it is broadly accepted that inflammation plays a causal role in SZ.
However, from a diagnostic perspective, this is irrelevant. A given marker has only to discriminate between two conditions, regardless of whether it is a cause, consequence, or correlate of the pathophysiological process. Second, there was a significant difference between the two participant groups in terms of education years, although the results remained unchanged when taking it into consideration as a covariate. Third, the sample size was moderate. A larger independent sample is essential to examine the reproducibility of our findings.

Conclusions
In conclusion, we developed an integrated framework of machine learning and used the combination of multi-biological data to discriminate patients with SZ from NCs, which substantially improved the classification performance. Our results demonstrate that features from multiple biological data provide complementary information that aids in providing effective and objective biomarkers to inform the clinical diagnosis of schizophrenia and our framework is effective in conveying comprehensive and complementary information for the purpose of classification.