‘Did I request thee, Maker, from my clay

To mould me man? Did I solicit thee

From darkness to promote me?’

—Milton, Paradise Lost

Gene expression and stem cell-derived cells

Gene expression data directly reflect genetic inheritance and acquired genetic mutations, as well as environmental influences. Scientifically, it may also help tie together and unravel epistasis (co-acting gene expression,1 ‘genes that change together work together’), as well as regulatory networks of non-coding single-nucleotide polymorphisms (SNPs),2 epigenetic changes, chromatin modifications, non-coding RNAs and transcription factors responsive to environmental stimuli. Human gene expression studies have been carried out in postmortem brain,3, 4 as well as in peripheral blood,5, 6 fibroblasts,7 olfactory epithelium-derived neurons8 and, more recently, in induced pluripotent stem cell (iPSC)-derived neurons.9, 10 Each particular approach has strengths and limitations (Table 1). The quest for peripheral tissue read-outs and biomarkers is particularly important in psychiatry, as the target organ (brain) is not accessible to biopsies in live humans, for obvious practical and ethical reasons. The integration of genomics with phenomics (for example, quantitative clinical data), in particular the issue of whether a marker reflects state, trait, both, or neither, is important and often overlooked. The ability to correlate peripheral read-outs directly with mental states (for example, symptom severity), or indirectly with mental traits (for example, psychiatric diagnosis), determines what kind of biomarkers can be discovered using different tissues and approaches.

Table 1 Human neuropsychiatric gene expression studies

iPSCs, in addition to future hypothetical organ-building regenerative medicine applications, may be more immediately useful for understanding disease,11 and particularly for drug testing and drug discovery,12 including personalized medicine approaches. However, concerns arise about genetic and gene expression artifacts induced by the in vitro stem cell creation process. Two of the four transcription factors used to create iPSCs (c-Myc and KLF4) are oncogenic. Interestingly, histone deacetylase inhibitors (HDACi), such as the neuropsychiatric agent valproate,13 may provide a safer alternative for helping transform adult cells into iPSCs. Valproate might also expand the pool of neural stem cells in the adult brain.14 The effect of the HDACi per se on the gene expression landscape would have to be factored out in scientific studies. Olfactory epithelium-derived neuronal precursor cells may also have less transformation artifacts,8, 15 although the cell culture and passaging artifacts remain in common with the other cell culture approaches. Finally, all neuronal-like cells derived with these methodologies need to be validated as being indeed reflective of true neurons. Some of the methods used for this are, in the increasing order of relevance, neuronal biochemical marker testing (immunohistochemistry), testing for synapse formation (electrophysiology) and functional integration in vivo.16

The gene expression data obtained from such cells arguably need additional cross-validation for relevance to in vivo functioning and disease states.

Convergent Functional Genomics (CFG)

Genetic and gene expression studies in humans and lower organism model (mice, rats, dogs, zebra fish, Drosophila, Caenorhabditis elegans, yeast) studies of medical disorders are becoming increasingly integrated. Particularly for genomics, the convergence and integration of data across experimental modalities, technical platforms, and species are providing a fit-to-disease way of extracting reproducible and biologically important signal, in contrast to the fit-to-cohort effect and limited reproducibility of human genetic analyses alone. Due to the emerging data from the ENCODE project suggesting that a major portion of the non-coding genome may contain regulatory variants, convergent approaches are going to be important to identify disease-relevant signal from the polymorphic variation present in the population.

CFG1, 5, 17, 18, 19, 20, 21, 22, 23, 24 is a powerful methodology developed over the past 15 years for extracting signal from noise by gene-level integration of multiple independent lines of evidence from human and lower organisms model studies—genetic, gene expression, proteomics—of brain, peripheral tissues and cell lines (Figure 1). Lower organism model data can provide sensitivity and ability to conduct experimental manipulations not feasible in humans. Human data provide more specificity and relevance to the human disease. Combined, we have an approach that increases our ability to distinguish signal from noise even with limited size cohorts and data sets. CFG helps to identify and prioritize candidate genes for the illness, using a polyevidence score. All these lines of evidence are the result of independent experiments. The virtues of this networked approach are that, even if one or another of the ‘nodes’ (lines of evidence) becomes questionable/non-functional upon further evidence in the field, the network is resilient and maintains the functionality. The prioritization of candidates is similar conceptually to the Google PageRank algorithm—the more links (lines of evidence) to a candidate, the higher it will be prioritized. Subsequent biological pathway analyses on these prioritized genes can uncover mechanistic aspects of the disease being studied. More recently, variations and expansions of this approach have been used successfully by other groups as well.25, 26

Figure 1
figure 1

Convergent Functional Genomics: multiple independent lines of evidence for integration and prioritization of induced pluripotent stem (iPS)-derived neuronal cells data. CNV, copy number variant; QTL, quantitative trait loci.

Our past work provides evidence for the advantages, reproducibility and consistency of gene-level analyses of data, as opposed to SNP level analyses, pointing to the fundamental issue of genetic heterogeneity at a SNP level.27 In fact, it may be that the more biologically important a gene is for higher mental functions, the more heterogeneity it has at a SNP level and the more evolutionary divergence, for adaptive reasons.28 A similar diversity, for similar adaptive reasons, exists in immune system genes.

On top of the gene-level integration, CFG provides a way to prioritize genes based on disease relevance, not study-specific effects (that is, fit-to-disease as opposed to fit-to-cohort). Reproducibility of findings across different studies, experimental paradigms and technical platforms is deemed more important (and scored as such by CFG) than the strength of finding in an individual study (for example, P-value in a genome-wide association study (GWAS)). This Bayesian-like approach minimizes false positives if one focuses on the top of the distribution, and minimizes false negatives if one goes deeper down the list (Figure 2). Most importantly, the CFG-prioritized genes show reproducibility and predictive ability in independent cohorts, which is the key litmus test for genetic and biomarker studies. Once the genes are identified and prioritized, biological pathway analyses can be conducted and mechanistic models can be constructed.

Figure 2
figure 2

Top candidate genes for schizophrenia—Convergent Functional Genomics (CFG) analysis of ISC genome-wide association study (GWAS). iPS cell, induced pluripotent stem cell; ISC, International Schizophrenia Consortium.

Using a set of mouse experiments as a driving force,20, 23 or using human blood gene expression5, 6 or GWAS data22, 24, 27 as a driving force, such convergent studies from my group and others have identified and prioritized candidate genes and biomarkers for psychiatric disorders (bipolar disorder,22, 24 schizophrenia,21, 27 anxiety disorders,29 alcoholism19, 30) that show good reproducibility as well as predictive ability in independent cohorts. In essence, the CFG approach is a de facto field-wide collaboration, integrating together the best available evidence at the time the analyses are conducted. Periodic re-analyses as future evidence accumulates in the field can improve and refine the results.

Application of CFG to stem cell-derived data

Data generated from neuronal-like cells derived from iPSCs can be cross-validated and prioritized using a CFG approach with other lines of evidence (Figure 1), or can serve as a line of evidence itself for the cross-validation and prioritization of, for example, GWAS data (Figure 2).

We have used the later approach for schizophrenia.27 Data published by Gage and colleagues from schizophrenia subjects10 in iPSC-derived neuronal-like cells (‘hiPSC neurons’) was used as one of the multiple lines of evidence in a convergent approach that incorporated, besides GWAS data,31 human postmortem data, human blood gene expression data6 and animal model pharmacogenomics brain and blood gene expression data (using phencyclidine and clozapine as agonist-antagonist pharmacological agents21). In all, 21% (9 out of 42) of the top schizophrenia candidate genes identified by us in our overall CFG analysis had evidence in the hiPSC neurons study, and in 6 out of 9 of these genes the direction of change in expression in iPS-derived cells was the same as that in postmortem brains from schizophrenics (HSPA1B, TCF4, CD9, KALRN, PRKCA and NRG1) (Figure 2). Given the fact that the ‘hiPSC neurons’ data in the original study were derived from only n=4 schizophrenic subjects,10 and there is intra-subject as well as inter-subject variability in cell lines, generating a large ( 596 unique genes) and potentially noisy list of differentially expressed genes, the use of cross-validating approaches such as CFG was essential to pinpoint the most disease-relevant genes.

The case of HSPA1B (heat-shock 70-kDa protein 1B), for example, a previously more obscure gene in terms of involvement in schizophrenia, is illustrative of the utility of a non-hypothesis-driven, convergent approach. HSPA1B, a chaperone involved in stress response, stabilizes existing proteins against aggregation and mediates the folding of newly translated proteins. HSPA1B has some previous genetic evidence for association with schizophrenia.32 It is co-directionally increased in expression in postmortem brains33 and iPSC-derived neurons from schizophrenia patients. HSPA1B is also decreased in expression by antipsychotic treatment with clozapine in the brain and blood of a mouse model, based on our previous work.21 It was also co-directionally increased in the brain and blood in a pharmacogenomic mouse model of anxiety disorders that we have recently described,29 as well as in a stress-reactive genetic mouse model.20 Treatment with the omega-3 fatty acid docosahexaenoic acid reversed the increase in expression of HSPA1B in this stress-reactive genetic mouse model.30 Another closely related molecule, HSPA1A (heat-shock 70-kDa protein 1A), is also present on our list of prioritized candidate genes for schizophrenia, with a lower CFG score of 3.5.27 Heat-shock proteins may be involved in the biological and clinical overlap and interdependence between response to stress,34 anxiety and psychosis.

A CFG approach could also be used in cases where HDACi are used for transformation, to understand which gene expressed in iPSC-derived cells are drug modulated. We have generated in our lab valproate brain and blood gene expression data sets5, 23 from mouse models, which could serve such a role.


Convergent approaches may be important for mining and interpreting gene expression data from pluripotent stem cell-derived cells in psychiatric and non-psychiatric disorders.