Transcriptomic immaturity inducible by neural hyperexcitation is shared by multiple neuropsychiatric disorders

Biomarkers are needed to improve the diagnosis of neuropsychiatric disorders, which are often associated to excitatory/inhibitory imbalances in neural transmission and abnormal maturation. Here, we characterized different disease conditions by mapping changes in the expression patterns of maturation-related genes whose expression was altered by experimental neural hyperexcitation in published studies. This analysis revealed two gene expression patterns: decreases in maturity markers and increases in immaturity markers. These two groups of genes were characterized by the over-representation of genes related to synaptic function and chromosomal modification, respectively. Using these two groups in a transdiagnostic analysis of 87 disease datasets for eight neuropsychiatric disorders and 12 datasets from corresponding animal models, we found that transcriptomic pseudoimmaturity inducible by neural hyperexcitation is shared by multiple neuropsychiatric disorders, such as schizophrenia, Alzheimer disorders, and amyotrophic lateral sclerosis. Our results indicate that this endophenotype serves as a basis for the transdiagnostic characterization of these disorders.


Supplementary Notes
To evaluate the time-course of changes in immature-like gene expression patterns after seizure induction, we compared datasets from the DG of typically developing infants with datasets from rat DG at three different timepoints after seizure induction by injection of pilocarpine or kainite (day 1, day 3, and day 10). As shown in Supplementary Figure 2a, the overlap P-values for the hiM genes were smaller than those for the hiI genes on the first day after seizure induction in both the pilocarpine and kainate datasets. Interestingly, the overlap P-values for the hiI genes were smaller than those for the hiM genes on day 3. These differences in the overlap P-values of the hiM/hiI genes with the number of days after seizure induction suggest that the expression levels of maturity-/immaturity-related genes induced by neural hyperexcitation change over time; the expression changes of hiM genes in the DG occur earlier than those of hiI genes after seizure induction. The similarity of the results for pilocarpine and kainite also suggests that the induction of immature-like gene expression is not dependent on the specific seizure-inducing drugs but on the induced neural hyperexcitation itself.
We performed principal component analysis (PCA) to visualize the relationships between hiM/hiI genes at different timepoints after seizure induction. For these analyses, we generated integrated datasets comprised of hiM or hiI genes at three different timepoints after seizure induction by pilocarpine or kainite (pilocarpine hiM gene sets, pilocarpine hiI gene sets, kainite hiM gene sets, and kainite hiI gene sets) and performed PCA for each dataset. Datasets at every timepoint after seizure induction and those of untreated control were plotted in a two-dimensional space with coordinates corresponding to principal components (PC) 1 and 2 (Supplementary Figure 2b). The PC1/PC2 coordinates of the hiM genes are different for day 1, day 3, and the untreated control, but those for day 3 and day 10 are close to each other (Supplementary Figure  2b), indicating that the expression changes in hiM genes after seizure induction become stable by day 3. However, the PC coordinates for hiI genes differed from each other at all timepoints after seizure (Supplementary Figure 2b), indicating that the expression changes in hiI genes continue after day 3 until at least day 10. These results also suggest that the changes in expression of the hiM genes follow a different temporal pattern than the changes in expression of hiI genes.
We conducted a bioinformatic estimation of the subcellular distributions of the protein products of hiM/hiI genes. To predict their subcellular distribution, we used the open-source tool COMPARTMENTS, which provides information on the estimated subcellular localization of genes of interest (see above). The subcellular distributions of hiM genes in the plasma membrane/nucleus were 28%/16% (in the pilocarpine dataset) and 26%/15% (kainite), and those of hiI genes were 16%/20% (pilocarpine) and 16%/23% (kainate) (Supplementary Figure 1c). The proportion of expression in the plasma membrane was higher in the hiM gene group than in the hiI gene group, while that in the nucleus was higher in the hiI gene group than in the hiM gene group. These results indicate that the protein products of hiM/hiI genes tend to have spatially distinct localizations.
We applied our analysis on the microarray expression datasets generated by Gandal and colleagues, in which confounding factors are largely controlled (Supplementary Figure  3). The hiM-index and hiI-index of datasets from patients with schizophrenia were 7.85 and 2.46, respectively, indicating hiM-index-dominant pattern. This result indicates that schizophrenia datasets used in Gandal's study also show transcriptomic pseudoimmaturity inducible by neural hyperexcitation. The hiM-index and hiI-index of datasets from bipolar disorder patients were 2.19 and 1.48, respectively, and those of major depressive disorder were both 0. These indexes are smaller than those of schizophrenia, which is mostly consistent with our results in Figure 3. The hiM-index and hiI-index of autism were -6.74 and 28.66, respectively. This hiI-index-dominant pattern is different from our results, as shown in Figure 3d and 3i. Although the pattern in autism is apparently different from ours, datasets from autism show significant overlap with pseudoimmaturity inducible by hyperexcitation, regarding hiI-index. We have added analysis of the datasets from alcohol abuse disorder and inflammatory bowel disease. Their hiM-/hiI-indexes were 25.85/-5.92 and -7.03/9.31, respectively.
To further evaluate biological significances of hiM and hiI genes, we performed weighted gene coexpression network analysis (WGCNA), which is a common method in systems biology for describing the correlation patterns among genes across microarray samples based on the k-means algorithms. We applied this method to the datasets from mouse DG development (postnatal day 8, 11, 14, 17, 21, 25, and 29 mouse infants and 33-week-old adult mice), and extracted 5 modules (turquoise, blue, brown, yellow, and grey) (Supplementary Figure 4a). We compared hiM and hiI gene groups with these 5 modules and evaluated their similarities by Running Fisher test (Supplementary Figure 4b). The hiM and hiI genes showed partially different patterns of overlap with these 5 modules of coexpressed genes during DG development. While hiM genes significantly overlap with turquoise, blue, and yellow modules, hiI genes significantly overlap with only turquoise module (Supplementary Figure 4b).
Additionally, we performed pathway analyses on these 5 modules in BaseSpace (Supplementary Figure 4c). The turquoise module includes biogroups associated with the nucleus (e.g., "cell division", "Genes involved in Cell Cycle", and "mitosis"), and blue module includes biogroups associated with synapse (e.g., "dendritic spine" and "synapse"). Similar biogroups were found in results of pathway analysis in Table1 (for example, biogroups such as "synaptic transmission" and "synapse" were involved in the results of pathway analysis for hiM genes; biogroups such as "Genes involved in Cell Cycle", "mitosis", and "cell division" were involved in the results of pathway analysis for hiI genes).
All information on the datasets and gene lists used in this study are provide as Supplementary Data 1 to 12.

Computing overlap P-values of gene expression patterns in different datasets
BaseSpace can be used to compare the signatures in publicly available microarray datasets with a signature provided by the user using a "Running Fisher" algorithm, as previously described [1][2][3][4][5] . To enable comparison across different arrays, orthologs were identified for each pair of organisms. Ortholog identification was based on information obtained from Mouse Genome Informatics (MGI) at Jackson Lab (http://www.informatics.jax.org), HomoloGene at NCBI (http://www.ncbi.nlm.nih.gov), and Ensembl (http://www.ensembl.org). The overlap P-value, i.e., the direction of the correlation between two given gene signature sets (b1, b2), and the P-values between subsets of gene signatures are calculated as follows.
Each gene signature set was rank-ordered according to the absolute fold-change value. Upregulated and downregulated genes were denoted by positive and negative signs, respectively, to indicate directionality. A directional subset was generated for each direction, such as b1+, b1-, b2+, and b2-.
Next, all of the subset pairs were identified as b1Di, b2Dj, where Di and Dj were the available directions (+ or −) in b1 and b2, respectively. The Running Fisher algorithm was applied to each subset pair. The top ranking genes in the first subset b1Di were collected as a group, G, and the second subset b2Dj was scanned from top to bottom in rank order to identify each rank with a gene matching a member in group G. At each matching rank, K, the scanned portion of the second subset b2Dj, consisted of N genes, and the overlap between group G and these N genes was defined as M. Fisher's exact test was performed at rank K to evaluate the statistical significance of observing M overlaps between a set of size G and a set of size N, where the set of size G comes from platform P1, and the set of size N comes from platform P2, given the sizes of P1 and P2 as well as the overlap between P1 and P2. At the end of the scan, the best P-value was retained, and a multiple-hypothesis-testing correction factor was applied. The negative log of the multiple-testing-corrected best P-value (Pb1Di→b2Dj) was a score (Sb1Di→b2Dj) for the subset pair. Here, the subscript b1Di → b2Dj indicates that b1Di was the first subset used to define the top genes G, and b2Dj was the second subset that was used for the scan. Sb1Di→b2Dj = -ln Pb1Di→b2Dj (1) Next, the Running Fisher algorithm was performed in the reverse direction. The same procedure in this reverse direction produced another score ( ) for the same subset pair. The two scores were averaged to represent the magnitude of the similarity between the two subsets. (2) The P value (Pb1Dib2Dj) between b1Di and b2Dj was calculated using the following equation: Pb1Dib2Dj = exp ( -Sb1Dib2Dj ) (3) A positive sign was assigned to pairwise correlation scores (Sb1+b2+ and Sb1-b2-) for a subset pair of the same direction (b1+b2+, b1-b2-), and a negative sign was assigned to pairwise correlation scores (Sb1+b2-and Sb1-b2+) for a subset pair of opposite directions (b1+b2-, b1-b2+). Then, the overall score (Sb1b2) between b1 and b2 was calculated from the correlation scores (Sb1+b2+, Sb1-b2-, Sb1+b2-, and Sb1-b2+) of subset pairs using the following equation: The sign of Sb1b2 reflected whether the two signatures were positively or negatively correlated. The overall P-value (Pb1b2) between b1 and b2 was calculated using the following equation: Pb1b2 = exp ( -|Sb1b2| ) (5) This overall P-value is referred to as the "overlap P-value" between two gene expression patterns in this paper.

Prediction of the subcellular of proteins coded by genes
To predict the subcellular localizations of proteins coded by genes in each dataset, COMPARTMENTS (http://compartments.jensenlab.org) 6 was used. COMPARTMENTS is a web resource that integrates evidence on protein subcellular localization from manually curated literature, high-throughput screens, automatic text mining, and sequence-based prediction methods. For each gene queried, it provides a score reflecting the localization to multiple cellular compartments (e.g., the plasma membrane, nucleus, cytosol, and so on) based on aggregating data from prediction algorithms (e.g., PSORT and YLoc). The WoLF PSORT program 7 , a sequence-based protein localization predictor, was used for the prediction of protein subcellular localization in this study. PSORT predicts subcellular localization based on various sequence-derived features such as sorting signals, binding domains, and amino acid composition. All PSORT scores were sorted by the number of stars (Spsort) assigned to a sequence-based prediction, from 0 to 3 (e.g., Rreb1: nucleus = 3, cytosol = 2, and other = 0; Fam107a: nucleus = 2, cytosol = 2, and other = 0). Detailed information about PSORT and Spsort score have been described previously 6,7 . In this study, the score of each gene was determined by the highest Spsort score among the subcellular compartment(s). If there were two or more compartments sharing the highest score, the score of each gene was determined by dividing the total number by the number of compartments with the highest score (e.g., Rreb1: nucleus = 1; Fam107a: nucleus = 0.5, cytosol = 0.5). The subcellular distribution patterns of hiM/hiI genes are expressed as the proportion of the integrated scores of the top 50 genes.

Principal component analysis
Principal component analysis (PCA) was performed to reveal the relationship between datasets of marker genes. PCA was performed for datasets composed of hiM/hiI genes at three different timepoints: day 1, day 3, and day 10 after treatment. From the comprehensive gene lists, the 500 genes with the largest fold changes were used for PCA. The two primary components in the results of PCA, which are denoted as principal component 1 (PC1) and PC2, correspond to the x-/y-axes on the graphs. For data processing and PCA, we used R for Mac OS X (The R Foundation for Statistical Computing).