Inflammation-related biomarkers in major psychiatric disorders: a cross-disorder assessment of reproducibility and specificity in 43 meta-analyses

Inflammation is a natural defence response of the immune system against environmental insult, stress and injury, but hyper- and hypo-inflammatory responses can trigger diseases. Accumulating evidence suggests that inflammation is involved in multiple psychiatric disorders. Using inflammation-related factors as biomarkers of psychiatric disorders requires the proof of reproducibility and specificity of the changes in different disorders, which remains to be established. We performed a cross-disorder study by systematically evaluating the meta-analysis results of inflammation-related factors in eight major psychiatric disorders, including schizophrenia (SCZ), bipolar disorder (BD), autism spectrum disorder (ASD), major depression disorder (MDD), post-trauma stress disorder (PTSD), sleeping disorder (SD), obsessive–compulsive disorder (OCD) and suicide. A total of 43 meta-analyses involving 704 publications on 44 inflammation-related factors were included in the study. We calculated the effect size and statistical power for every inflammation-related factor in each disorder. Our analyses showed that well-powered case–control studies provided more consistent results than underpowered studies when one factor was meta-analysed by different researchers. After removing underpowered studies, 30 of the 44 inflammation-related factors showed significant alterations in at least one disorder based on well-powered meta-analyses. Eleven of them changed in patients of more than two disorders when compared with the controls. A few inflammation-related factors showed unique changes in specific disorders (e.g., IL-4 increased in BD, decreased in suicide, but had no change in MDD, ASD, PTSD and SCZ). MDD had the largest number of changes while SD has the least. Clustering analysis showed that closely related disorders share similar patterns of inflammatory changes, as genome-wide genetic studies have found. According to the effect size obtained from the meta-analyses, 13 inflammation-related factors would need <50 cases and 50 controls to achieve 80% power to show significant differences (p < 0.0016) between patients and controls. Changes in different states of MDD, SCZ or BD were also observed in various comparisons. Studies comparing first-episode SCZ to controls may have more reproducible findings than those comparing pre- and post-treatment results. Longitudinal, system-wide studies of inflammation regulation that can differentiate trait- and state-specific changes will be needed to establish valuable biomarkers.


Introduction
Inflammation is a regulated process that occurs in response to injury or stress. Inflammation is also implicated in psychiatric disorders. Triggered by various internal and external stressors, inflammation is moderated by the immune system through a series of signalling pathways, particularly proteins like pro-and antiinflammatory factors. Changes of some inflammationrelated factors (IRFs) have been intensively studied and reviewed in relation to psychiatric disorders [1][2][3][4][5][6][7][8][9] . Furthermore, pharmacology studies suggest that inflammation may have a causal effect in the development of psychiatric disorders. On one hand, drugs that target IRFs have been proposed for treating psychiatric disorders, including major depression (MDD) [10][11][12] , schizophrenia (SCZ) 13 and bipolar disorder (BD) 14 , particularly in a subset of patients with chronic inflammation 10 ; a recent meta-analysis of anti-cytokine treatment showed significant antidepressant effects 10 . On the other hand, some current psychotropic drugs, including antipsychotics and antidepressants, have anti-inflammatory effects, though with inconsistent findings [15][16][17][18][19] .
The relevance of inflammation to psychiatric disorders suggests its potential as a suitable biomarker. A biomarker is defined as "any substance, structure, or process that can be measured in the body," which can "influence or predict the incidence of outcome or disease" 20 as well as the treatment response 21 . Measurable in peripheral blood, IRFs have the advantage of accessibility. Currently, diagnosis of functional psychiatric disorders is based mainly on subjective interview. The addition of trait biomarkers associated with disease diagnoses can only improve the accuracy of diagnosis. Furthermore, state markers associated with specific symptoms or stages of disease can help to monitor treatment. Accordingly, it is crucial to explore whether IRFs can be used as trait and/or state markers of psychiatric disorders. IRF changes need to be associated with diseases, symptoms or treatment effects, and more importantly, such associations must be specific and reproducible to turn IRFs into biomarkers. Although dozens of meta-analyses have been published on IRFs in psychiatric disorders, the reproducibility of those results has not been thoroughly evaluated. Without comparing the studies from different disorders, the specificity of IRFs cannot be determined.
Here, we systematically reviewed meta-analysis studies of IRF changes in major psychiatric disorders. Through the evaluation, we found that meta-analyses with sufficient statistical power produced stable results. After filtering out the underpowered data, we identified IRF changes that were shared across disorders or unique to specific disorders as potential trait markers. We identified 30 IRFs showing alterations in at least one disorder. We also identified IRFs as potential state markers for specific types of patients with MDD, SCZ or BD in various comparisons. We discussed the missing aspects of existing studies and recommended future study designs needed to translate IRFs into biomarkers.

Systematic review of meta-analyses on inflammation-related factors in psychiatric disorders
Hundreds of studies have been published on IRFs in psychiatric disorders in recent decades. Most of these studies have small sample sizes and offer limited statistical power, but statistics from these small studies can be pooled in meta-analysis, boosting power to detect true signals. Meta-analysis detects consistent changes across multiple studies. Biological or technical noises, errors and mistakes in individual studies are likely to be random and less likely affect the meta-analysis results. We chose to analyse meta-analysis publications to avoid much of the noise. We searched meta-analysis studies of IRFs in PubMed, Springer and Web of Science with the following search builder: ((((cytokine) OR ((chemokine) OR (Interleukin) OR (interferon) OR (tumour necrosis factor) OR (colony-stimulating factor) OR (growth factor) OR inflammat%))) AND (#DISORDER#) AND meta-analysis. #DISORDER# is one of the major psychiatric disorders: SCZ, BD, autism spectrum disorder (ASD), MDD, posttraumatic stress disorder (PTSD), anxiety, suicide, sleeping disorder (SD) or obsessive-compulsive disorder (OCD). For growth factors, we selected only those known to regulate inflammation, such as IGF 22 , VEGF 23 , NGF 22 , FGF 24 and TNF 24 . The inclusion criteria were the following: 1) published in English before March 2018; 2) assessed circulating cytokine/chemokine/inflammationrelated factor levels in plasma or serum samples or the central nervous system; 3) meta-analysis only. The studies involving patients of different treatment stages or symptom states (state-related) were analysed separately from the case-control studies (trait-related). We discarded two studies testing for the correlation between the IRF levels with depressive symptoms [25][26][27] since their measurements were not compatible with the other differential studies. The literature selection followed the standard by Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 28 (Supplementary Figure, SF1).
We collected 43 meta-analysis publications on 44 IRFs in eight major psychiatric disorders. Data from 43 metaanalysis studies included SCZ (number of studies, n = 11), MDD (n = 16), BD (n = 6), OCD (n = 1), suicide (n = 2), ASD (n = 1), PTSD (n = 1), SD (n = 1), and cross-disorder studies (n = 4). Anxiety did not meet all our criteria for inclusion. One publication can contain meta-analyses of several IRFs in multiple disorders targeting different comparisons; therefore, we partitioned the data to serve two different analyses: trait marker analysis and state marker analysis (Table 1). Trait marker analysis targets IRF changes in undifferentiated case-control comparisons. State marker analysis targets IRF changes in patients with the same disorder but different clinical states. When a meta-analysis used patients with a clearly Table 1 Use of the meta-analysis publications for trait and state marker analyses   BT-AT before treatment-after treatment comparision,  (Table 1). We summarised the results from the meta-analysis, including disorders and IRFs studied, sample sizes, tissue sources, statistical practice (controls for the batch effect, covariate and multiple testing inflation) and major findings (details in Supplementary Table S1). Most data were generated from peripheral blood without differentiating whole blood, serum and plasma; one study involved three IRFs in the brain 29 , and four meta-analyses were analysed in both blood and the cerebrospinal fluid (CSF) 13,[29][30][31] . The most frequently studied disorders are MDD, BD, suicide and SCZ. These four disorders were meta-analysed numerous times, giving us an opportunity to assess the stability of the metaanalysis results over time.
Evaluation of the stability of findings by consistency in IRFs meta-analysed multiple times To evaluate the reproducibility of published changes of IRFs in psychiatric disorders, we examined the stability of the reported changes in IRFs that were analysed in one of the four frequently studied disorders. Ideally, reproducibility should be measured by using replicates from independent samples. Independence of samples was problematic in this study because meta-analysis studies normally use most if not all available individual studies, including those used by previous meta-analysis studies. Therefore, the different meta-analysis studies of the same IRF-disease combinations cannot be considered independent replicates. We solved this problem by using stability as a proxy of reproducibility. Even though stability is not a true measure of reproducibility, we assume that stable findings are more likely to be reproducible than unstable ones. Through the stability evaluation, we learned that studies with high statistical power (>80%) produced relatively stable findings. Therefore, we used statistical power to filter all meta-analyses and identified inflammatory changes that are likely to be reproducible.
By using the stability of the results to assess the reproducibility for the IRFs changed in BD, SCZ, suicide and MDD, we found that some findings in these metaanalyses were stable while others were not. For example, sIL-2R and sTNF-R1 were significantly elevated consistently in BD [32][33][34] . Some significant IRF changes only appeared in recent studies as more data became available. For example, the difference of IFN-gamma was not significant between MDD and controls in a study of 238 subjects 35 , but a later meta-analysis that includes a total of 1470 subjects found IFN-gamma to be significantly decreased in MDD patients 36 . IL-8 was reported to have no significant changes in MDD by a study of 560 subjects 37 , but it was found to be increased in MDD in a later study of 3788 subjects 38 . These conflicting results suggest that those changes can only be detected in larger sample sizes.
We hypothesised that the inconsistent results of the same IRF within the same disorder may result from underpowered studies. To test this hypothesis, we extracted the effect size measured as a standardised mean difference of every IRF in each disorder from those metaanalyses, and calculated their statistical powers by using G*power 39 targeting p < 0.05. Totally, 26% of the metaanalyses did not have enough power (<80%).
For the IRF meta-analysed multiple times, 42% (16/38) of the tests showed inconsistent results in case-control comparisons. Totally, 67% (24/36) of the tests showed inconsistent results in state-related meta-analyses. After filtering out the tests with insufficient statistical power, only 16% (5/31) of case-control-related tests still showed inconsistent results. Totally, 48% (15/31 total, 4/10 of BD, 5/6 of MDD and 6/15 of SCZ) of the state-related tests remained inconsistent. Filter by statistical power clearly improved the consistency of the results by about 20% or more for both trait and state-related analyses. If two studies were completely independent, as each result has three possible states: increase, decrease and no change, the chance of two results having the same type of state is 11% (1/3 × 1/3). The observed consistency here (>33%) is much higher than 11% even in the underpowered metaanalyses. This high consistency can be partially caused by the relatedness of data used in the two meta-analyses. But the overall pattern of consistency, particularly that improved consistency associated with filter by statistical power is very encouraging. All consistencies are higher than 50% after filtering. Case-control trait-related findings have 84% consistency. The results of SCZ and BD case-control comparisons showed 100% consistency. Trait-related findings are generally more reliable than state-related tests, which have consistencies ranging between 16% and 60%.
We looked into the inclusion and exclusion criteria of all the meta-analyses and noticed large variations: 37% of the meta-analyses required diagnosis with DSM/ICD; 20% of the studies required specific age ranges; 70% of the studies removed data without controls; only 20% of the studies removed duplicated data; 11% of the studies excluded participants with other comorbidities. Very few used identical criteria. The inconsistent use of inclusion/ exclusion criteria could potentially increase the inconsistencies for the results of meta-analyses. But our analyses showed sufficient consistency for those wellpowered data (84% consistency for the results of case-control comparisons). The inconsistent inclusion/ exclusion criteria used by different studies did not destroy the findings.

Trait-related biomarkers
Reproducible findings in meta-analyses from well-powered studies Based on the stability evaluation, we decided to analyse only the data from the well-powered studies because they were more likely to yield reproducible findings. We removed the data with statistical power lower than 0.8. We applied this filter to all data, assuming that all data including those tested only once will have better reproducibility. The single meta-analysis of OCD was filtered out because of the insufficient statistical power. Thirtyeight IRFs survived the filtering. Thirty of the 38 IRFs showed significant alterations in at least one disorder ( Table 2). All well-powered studies of IRFs (power > 0.8) are detailed in Supplementary Table S2.
We also examined the impact of tissue types on the reproducibility of findings. The single study on IRFs from the brain was under-powered 29 . Five IRFs were significantly changed in SCZ patients based on CSF samples. IL-6 was consistent with the changes detected in blood while IL-1beta was inconsistent. Three others (KYN, IL-8 and KYNA) were not studied in blood metaanalyses.

Disease specificity based on cross-disorder comparison
To evaluate the specificity of each IRF in disorders, we investigated whether the change of one IRF occurs universally in all psychiatric disorders or only in some disorders by using only well-powered data. Whenever an IRF only shows a change in one or several disorders but not every disorder, we consider that it merits disease specificity. An IRF change is considered to be non-specific when all the disorders carry the same change.
We used the IRFs that changed for multiple disorders patients compared with controls to assess their degree of specificity. Thirteen of them were analysed in more than half (4) the disorders, including TGF-beta, IL-10, IL-1RA, IL-1beta, IL-2, IL-4, IL-6, IL-8, sIL-2R, sIL-6R, TNFalpha, IFN-gamma and C-reactive protein (CRP). All these IRFs changed in some disorders but had no change or even opposite changes in other disorders. Among them, IL-2 is uniquely significantly low in suicide; IL-4 is significantly low in suicide, but high in BD; sIL-6R is uniquely significantly high in BD; sIL-2R is uniquely not changed in PTSD. IL-6 and CRP are the two most commonly increased IRFs changed in four disorders. The most commonly decreased IRF is the nerve growth factor (NGF) in MDD and SCZ.
The specificity of IRFs across disorders can reflect a similarity in the aetiology or pathology of disorders. Unsupervised clustering analysis on the presence of the changes of the 30 IRFs in the seven disorders showed an interesting pattern-BD and SCZ were closer to each other than the others, while suicide clustered with PTSD and SD (Fig. 1). This pattern matched the clinical similarity among these disorders and their genetic similarity, based on published genetic studies 40 .

State-related biomarkers
IRFs that change at different states of disorders were also examined in the well-powered studies. Two types of studies have been used in state-related analyses, including ten studies that compared BD, SCZ and MDD patients at different clinical states with controls, and twelve studies that compared pre-and post-treatment patients. Quantitative differences in IRFs among states of diseases were observed.
Out of the sufficiently powered meta-analyses, the SCZ data have the same consistency (9/15, 60%) as BD (6/10, 60%) data, and are better than MDD (1/6, 17%) data when IRFs were meta-analysed multiple times. The pre-and post-treatment comparison is one of the major sources of inconsistency (3/6 in SCZ and 5/5 in MDD). For example, IL-6 has inconsistent findings in SCZ preand post-treatment comparison 41,42 , but was reported to have consistent changes in several different states of SCZ: higher in acutely ill, chronically ill and firstepisode SCZ patients than in controls. CRP is the only IRF that was reported to change in all three BD (manic, depressive and euthymic) states. But the depression state has conflicting results 43,44 . Studies comparing patients with first-episode SCZ patients with controls contain the most consistent results as reported in the two meta-analyses by Miller and collaborators 13,45 , many were also replicated by Upthegrove et al. 46 except for IL-4 and IFN-gamma.
The state-related data carry more uncertainty than traitrelated data. We are particularly concerned by the frequent inconsistencies observed in the pre-and posttreatment comparison studies. Among these state-related studies, findings from first-episode SCZ studies are likely more reliable. All the results are shown in Supplementary  Table S3 for BD, S4 for MDD and S5 for SCZ.
It should be noted that our study was built upon metaanalysis results, which represented the consensus of multiple individual studies, less influenced than original individual studies by biological or technical noises or variations from individual studies. At the same time, meta-analyses cannot recover signals that may be missed by original studies due to technical limitations. Metaanalyses can be misled by systematic bias in individual data. Therefore, systematic investigation with better technologies and better experimental design has the potential to uncover IRF changes that we missed, or to disapprove a finding claimed here. Table 2 Well-powered significant changes of inflammation-related factors in psychiatric disorders and estimates of sample size required to achieve 80% power    Table 2 continued "-" refers to single study without consistency check

Better study design to translate inflammationrelated factors to biomarkers
More research is necessary before IRFs can be used as clinically useful biomarkers. We noticed some major pieces missing in previous studies, mostly due to the limitations of technology and costs. Most of those studies focused on candidate genes, which do not provide systemwide information. Statistical procedures were commonly liberal, leaving more space for false findings. Since longitudinal studies are still rare, we can neither differentiate acute and chronic changes nor can we discern trait and state markers. Lastly, blood-brain comparisons may not be necessary for developing biomarkers, but such comparisons are important for establishing the mechanisms of psychiatric disorders.

Unbiased system-wide study
Candidate gene studies contributed to answering specific questions in biology, but they frequently missed the big picture, without assessing the impact of the genes that changed the most. Candidate genes are generally not the top signals in genome-wide or system-wide studies. The bigger differences between cases and controls with better disease specificity are hidden in the systems and remain to be discovered. Therefore, the current candidate genes, including all the IRFs reviewed here, may not be the best genes for biomarker development. The other genes discovered from system-wide hunting may be better biomarker candidates.
The regulatory system that modulates inflammation remains poorly understood. Regulation of inflammatory responses occurs at various levels including genetic, transcriptional, translational, protein modification and protein interaction, and involves many signalling pathways 47 . One GWAS of inflammation quantitative trait locus (QTL) identified some potential genetic regulators of inflammation 48 . The omics-based approaches can help reveal more about these regulatory processes. Genomic and epigenetic methods have been applied to adipose tissue 49 and blood leucocytes 50 , leading to the discovery of large inflammatory response networks. These studies have only scratched the surface of this complex regulatory system. More remains to be learned regarding the regulation of inflammation in response to both acute and chronic stressors.
Hundreds of genes or proteins are involved in immunity and inflammation regulation and should be studied in relation to psychiatric disorders. Only a limited number of them have been investigated in psychiatric disorders due to the constraints of methodology and resources. Previous studies were completed by using the enzyme-linked immunosorbant assay (ELISA) or multiplex protein assays. Multiplex protein assay is a technology getting more attention recently 51 . It has been used in studies of alcoholism 52 , depression 53 and autism 54 , with up to 41 IRFs examined simultaneously 52,53 . The rapidly evolving mass spectrometry-based proteomics will soon provide more comprehensive coverage. The advanced technology will make system-wide studies of inflammation feasible in the near future.

Mandatory power analysis
Our analyses indicate that well-powered studies yield more reproducible results. Few studies reported statistical power. We performed power analysis 39 on the 427 metaanalyses of 44 IRFs done in the eight disorders, 318 of them (74%) have sufficient power ( ≥ 0.8), while the rest were under-powered. Sample size, one of the major drivers of statistical power, is a limiting factor for many studies. The good news from this study is that the sample size required to detect the differences in IRFs between patients and controls might be relatively small for many IRFs. Based on the median effect size summarised in  study at the same time in one disorder, the significance level can be set to p < 0.0016. At that p value, 13 IRFs would only need <100 subjects (50 cases and 50 controls) to achieve 80% power to detect a significant difference (Table 2). Certainly, since many of the IRFs have limited studies, their effect sizes in this table may be inflated. Independent testing will be needed.
Practicing strict experimental, analytical procedures and statistical criteria Strict quality control procedures and sufficient attention to covariance are critical towards reducing false-positive rates. From our review, we noticed that few existing studies paid sufficient attention to data quality control or statistical procedures controlling for covariates. Most IRF studies neither evaluate technical variances for their assays, nor did they report data-missing rates. These experimental factors can impact data quality. Standard requirements should be developed as common practice for IRF studies as they are for most genomic studies.
Covariates of many types should be considered. ELISA and the multiplex assay results can be easily confounded by technical artefacts, including batch effects (differences in the same sample measured at different times) and site effects (data collected from different machines may not be measured consistently). Ignoring the covariates for inflammation such as blood counts, BMI, sex and chronic stress levels can lead to false findings. The inappropriate management of covariates can misdirect meta-analyses. As shown in Supplementary Table S1, 93% of the previous studies did not address controlling batch effects for large sample sizes. Nearly 40% of the studies did not address covariates in their analyses. Collecting data from as many variables as possible and controlling them in analyses should be standard practice in future research.
Strict statistical criteria for defining significance should be consistently applied across studies. Many publications considered p ≤ 0.05 to be significant even when multiple IRFs were tested. With that, publication bias for the positive results likely occurred 55 . In Supplementary Table S1, about 75% of the studies did not correct for multiple testing inflation. Genetic studies of candidate genes have produced a large number of false positives in small-sample studies due to loose statistical criteria. Some scientists propose changing the threshold of significance to a smaller and more debatable value like 0.01. Inflammation studies should consider implementing a stricter threshold to reduce the number of false positives. Small-sample studies are valuable to scientific literature as they benefit meta-analysis and large consortium efforts, particularly since collecting clinical data is very challenging and very expensive. But standardised common practices to ensure high-quality data are critical for these small studies too.

Longitudinal study for acute and chronic changes, trait and state biomarkers
Acute and chronic inflammation can have very different impacts and consequences in aetiology and pathology. Many physical and psychological stressors, environmental insults including acute smoking 56 , acute ethanol intoxication 57 , long-term smoking 58,59 , binge drinking 60 , psychological stress [61][62][63] , diet 64 , air pollution 65 and infection all can induce inflammation. The inflammation levels likely also fluctuate temporally due to environmental and circadian rhythm changes regardless of health status. Single-time-point measures could be misleading. Therefore, longitudinal studies with multiple time-point measures of inflammation in peripheral tissues are necessary to accurately assess the inflammatory changes. The economical and convenient methods of data collection need to be developed for longitudinal studies.
Biomarkers will be more informative if they can differentiate traits and states to facilitate diagnosis and differential diagnosis and monitor prognosis. In this study, we only had data to compare eight psychiatric disorders. Changes of IRFs are not limited to neuropsychiatric disorders; they are also reported in patients with cardiovascular and metabolic diseases 66 , as well as cancer 67,68 . Therefore, much remains to be learned about the specificity of each individual IRF as related to various psychiatric and non-psychiatric diseases.
State markers represent the status of clinical manifestations in patients 69 . The clinical symptoms can change through the disease course while the diagnosis remains unchanged. To differentiate trait and state markers, we will need to examine patients of different episodes, illness or symptom stages, instead of lumping all patients into one case group.
A few IRFs were presented as potential state and trait markers. Our consistency analysis based on sufficiently powered meta-analyses suggests that trait-related findings might be more reliable than state-related findings so far. The state-related analyses have more inconsistent results than the trait-related analyses. More work remains to be done to accurately identify the state-related changes.

Peripheral vs. central nervous system
Biomarkers of psychiatric disorders need to have biological relevance based on solid mechanistic understanding that will ultimately link blood and brain function. In our data, the difference of IL-12 levels in CSF between SCZ and controls was not significant 13,46 ; however, IL-12 significantly increased in the blood of patients with SCZ 13,45 . We can reasonably hypothesise that inflammation could be tissuespecific. Tissue specificity may differentiate some disorders. The central nervous system may have a different inflammatory response than the peripheral tissue depending on the source of the insult. This may be particularly true in relation to the blood-brain barrier, which separates the central nervous system from peripheral circulation and responds to inflammation under its own regulation 70,71 . A systematic evaluation of the similarity and dissimilarity of inflammation across tissues in different disorders is needed. Peripheral blood data are useful for developing clinically informative biomarkers, but the brain is the ultimate affected tissue of psychiatric disorders. Therefore, building the blood-brain relationship will be critical in proving the biological relevance and illustrating the biological mechanisms. Postmortem brain data are valuable, and the PsychENCODE project 72 brought more insights into the inflammation status of brains affected by SCZ, BD and ASD. However, resolving the causal relationships in the study of inflammation using postmortem brains will be challenging. Animal and cellular models will hold great value in understanding the regulation of inflammation [73][74][75] .
In summary, this systematic evaluation of inflammatory changes across multiple psychiatric disorders showed us the intriguing possibility of differentiating psychiatric disorders by using inflammatory biomarkers. Moreover, inflammation may have a large enough effect size that these biomarkers could be detected in relatively small sample sizes. We propose that a system-wide longitudinal study using strict analytical procedures could render effective and useful biomarkers.