Abstract
Genome-wide association studies and other discovery genetics methods provide a means to identify previously unknown biological mechanisms underlying behavioral disorders that may point to new therapeutic avenues, augment diagnostic tools, and yield a deeper understanding of the biology of psychiatric conditions. Recent advances in psychiatric genetics have been made possible through large-scale collaborative efforts. These studies have begun to unearth many novel genetic variants associated with psychiatric disorders and behavioral traits in human populations. Significant challenges remain in characterizing the resulting disease-associated genetic variants and prioritizing functional follow-up to make them useful for mechanistic understanding and development of therapeutics. Model organism research has generated extensive genomic data that can provide insight into the neurobiological mechanisms of variant action, but a cohesive effort must be made to establish which aspects of the biological modulation of behavioral traits are evolutionarily conserved across species. Scalable computing, new data integration strategies, and advanced analysis methods outlined in this review provide a framework to efficiently harness model organism data in support of clinically relevant psychiatric phenotypes.
Similar content being viewed by others
Promises and challenges in human genetics of psychiatric disorders
Psychiatric disorders are highly polygenic and show a continuous range of variation influenced by both environmental and genetic factors [1]. A major goal of psychiatric genetic research is to better understand the molecular mechanisms through which genetic variants act to influence liability to these traits. The identification of novel genetic variants provides a foothold into the complex genetic architecture that undergirds psychiatric traits. Model organisms provide an avenue into understanding the biological mechanisms that are impacted by genetic variation. In this review, we outline big data approaches that efficiently weave the vast amounts of convergent genomic data from other species into human genetic findings to elevate the likelihood of uncovering biologically meaningful pathways for further experimental follow-up and therapeutic discovery.
The utility of genome-wide association studies (GWAS) in psychiatry
GWAS of psychiatric traits have generated an outpouring of recent discoveries in risk variant identification and polygenic prediction. From highly heritable traits, such as schizophrenia (for which >100 common loci have been reported with N = 150,064 [2]) to common but less heritable conditions such as problematic alcohol use (for which 29 independent loci have been reported with N = 435,563 [3]) and major depression (for which 102 common loci were detected with N = 807,553 [4]), as well as for liability across psychiatric disorders (109 loci with N = 727,126 [5]) progress abounds. In addition, for substance use, a recent large GWAS of tobacco smoking (N for smoking initiation = 1,232,091) and typical drinking (N for drinks/week = 941,280) has identified over 400 loci [6]. The increased power accumulated across studies of major psychiatric disorders, arising from collaborative research, has revealed clues into novel mechanisms of susceptibility to mental illnesses and substance use disorders. These large-scale GWAS have also revealed patterns of genetic variation associated with multiple disorders as well as disorder-specific loci, e.g., CADM2 has been linked to multiple substances and common addiction mechanisms (e.g., risk-taking cognition), while the alcohol dehydrogenase genes remain alcohol-specific (e.g., [7, 8]).
Challenges and opportunities within GWAS for psychiatric genetic studies
The recent gains in psychiatric genetic studies outlined above amplify the need to address several enduring challenges within GWAS. First, at a variant level, the bulk of GWAS “hits” fall in noncoding regions of the genome. A major advantage of GWAS as a means of discovering the biological basis of psychiatric disorders is that the lack of a priori, gene centric hypotheses enables discovery of trait regulatory variants in enhancer and promotor regions, lncRNAs, microRNAs, and any other molecular entity that is part of the gene-regulatory mechanism. However, in contrast to variants within coding genes, it is far more difficult to link statistically significant genetic associations to the gene products and biological mechanisms through which they act [9]. Interpretations of significant GWAS findings are complicated by patterns of related inheritance (e.g., linkage disequilibrium), such that the most strongly associated genetic variant in a locus may not be “causal” but could “tag” a true causal variant. This, coupled with long distance genomic regulation, poses challenges for unveiling specific genes and variants underlying human traits via GWAS [10]. In this review, we highlight how regulatory genetic variants can be integrated coherently with coding genes within and across species using unifying data structures.
A second challenge with GWAS is that power analyses reveal that the massive polygenicity underlying psychiatrically relevant traits and illnesses requires larger sample sizes for additional discoveries from GWAS data alone [11]. Likewise, the predictive power of a polygenic risk score (PRS), an index of aggregated genetic susceptibility to a disorder, for psychiatric disorders is also directly linked to the current statistical power of discovery GWAS [12]. However, the identification of additional trait-associated variants continues to substantially augment SNP-heritability estimates, especially in the case of rare variants, suggesting that there is more signal to be found in GWAS and sequencing studies [13], provided that higher sample sizes continue to be attained. In this review, we highlight approaches that exploit complementary data resources from model organisms that, when placed in an integrative framework with GWAS data, are showing some promise in prioritizing variants that are detected.
Third, consistent with indications from early family and twin studies, there is evidence for pleiotropy among psychiatric traits to a degree suggestive of an underlying dimension of genetic liability that parallels the general factor model of psychopathology [5, 14]. Thus, it is important to consider variants in context of both the underlying neurobiological mechanisms in which they function, and the multiple traits that are influenced by that variation to find the specific, as well as the overlapping biological mechanisms underlying behavioral traits.
A landmark contribution to our current ability to annotate GWAS signals arise from FUMA [15], a platform for functional and regulatory annotation of variants. Summary statistics from a GWAS can easily be aligned with tissue and cell-type-specific expression data and to a variety of regulatory and chromatin signatures with no computational burden on the user, making FUMA widely accessible. As an alternative to gene-based mapping techniques, software tools can also map variants to the noncoding transcriptome (e.g., LincSNP 3.0 [16]). Beyond variant mapping, harnessing multiple sources of omics data can be utilized in a multivariate framework to implicate “causal” gene sets for a disease state (e.g., SMR [17], iRIGs [18], PAINTOR [19], FOCUS [20]). Efforts are also underway, with varying degrees of success, to demonstrate to what extent similar regulatory enrichment of PRSs could enhance prediction (e.g., AnnoPred [21], LDpred-funct [22]). However, most of these approaches have been limited to human genetics and genomics data. In this review, we highlight approaches that bring together the breadth and depth of well-controlled model organism studies that place genetic and genomic findings in biobehavioral context that can expand on this or other interpretive tool sets.
Multi-species genomics to address challenges in GWAS variant interpretation
Across these historical and contemporary research challenges, Big Data approaches that harness information from additional sources, including cross-species genomic analyses, can provide elegant solutions to current barriers in psychiatric genetics [18, 23]. It cannot be understated that we need better-powered GWAS, especially as we look to polygenic scores as a means of leveraging the modest effect sizes from GWAS. However, increasing the sample size alone may be merely a theoretical solution for certain traits where rare variation and modest effect sizes contribute substantially. Incorporating evidence from molecular and cellular biology shifts the focus of genome-wide analyses from variant detection and identification to evaluating the relative contribution of a prioritized subset of loci. This helps control the family-wise error rate, thus increasing power, and provides context about the genome at multiple levels (i.e., structure, function, and regulation) while also accounting for the polygenicity of a trait. Leveraging information from annotated genomic regions that affect gene function was shown to robustly increase the power to identify genomic associations across 27 human traits [24].
There is extensive information available from human and model organism functional genomics that may be brought to bear on human GWAS findings in the context of specific behaviors, tissues, and molecular mechanisms [25, 26]. Prior to the widespread availability of human ‘omics data, some of the earliest efforts to characterize the mechanism of variants detected in human association studies relied on expression of orthologous genes from studies performed in animal models. The rich data resources from these studies continue to be valuable due to the breadth and depth of studies that are possible in animal models, under precisely controlled conditions of drug exposure and other neurobiological or behavioral processes. Further, model organism data also contain a rich source of expression regulatory information including expression quantitative trait locus (eQTL) and epigenetic data from many tissues and brain regions, some of which is collected in populations that facilitate the global correlation of transcript abundance to neurobiological and behavioral parameters [27]. Integration of functional genomic information from multiple species into GWAS provides new clues about the biological context and consequences of genetic associations and PRSs, and provides insight into how to model such variation in in vivo preclinical models with intact central nervous systems and expression regulatory machinery.
Below, we illustrate the promise of harnessing these model organism data, for which decades of comparative behavioral research has produced numerous experimental paradigms aimed at consilience, such as drug self-administration and response studies across multiple mouse and rat populations in genetics and genomics [28]. We propose methods for integrating valuable and ever-expanding complementary model organism and human genetics and genomics data (such as GTEx [29] and GeneNetwork.org [30], psychENCODE [31] and modENCODE [32]) and highlight new approaches for boosting power in human genetics through Bayesian inference in heritability and polygenic analyses, outline exciting developments aimed at bridging the “analytic currency” gap between human and model organism research, and present some technical and philosophical challenges. The overarching goal of this review is to focus on ways in which we might utilize the complementary strengths of human and animal genetics to advance their common research mission: gaining a better understanding of the biology of complex traits.
Potential and challenges for model organism data integration
There is considerable and growing interest in employing nonhuman animals to meet some of the challenges for human genetics outlined above. There is a tremendous depth and breadth of model organism genetics and genomics studies spanning many areas of behavioral and neurobiological parameters. These include differential expression studies following various behavioral and drug exposure paradigms [33], large-scale screens of gene-targeted deletion mutants [34], and genetic studies in populations such as the BXD RI mouse lines [35] and inbred strain panels [36], which often combine gene expression and genetic analysis. Numerous QTL positional candidates have been identified from a large number of behavioral and neurobiological mapping studies [37]. Selective breeding in rats and mice have been able to separate alcohol preferences [38, 39] and chronic use/withdrawal [40]. These data provide a rich backdrop and context in which to interpret the more global phenotype or disease information that is the frequent subject of GWAS analysis.
Animal geneticists have a rich history of using model organisms to study behavioral traits that mirror aspects of human psychopathology. Many of the genes and variants identified in model organisms are also now also being found in human GWAS studies (Table 1), indicating that convergence of these studies is feasible. To date, model organism evidence has largely been used as a form of post-GWAS validation to characterize significant SNP/gene effects (e.g., [41, 42]). There have been a few promising recent examples of model organism research that, when coupled with human GWAS findings, have revealed insights into the biological mechanisms underlying psychiatric disorders. Model organism data have also produced experimental insight into disease mechanism. For example, researchers used mouse models to study the effect of a particular protein, complement component 4 (C4), on synaptic mediation during development [43]. By using a mouse model in conjunction with convergent evidence from human genomic studies, researchers were able to study the effects of C4 gene deficiencies on synapse elimination during postnatal development in a way that is not possible in humans. Researchers are beginning to leverage model organism genomics directly in the context of human genetic studies. For instance, gene co-expression networks associated with mouse neurodegeneration phenotypes demonstrated enrichment for human GWAS associations with Alzheimer’s disease [44]. Integrative methods for jointly analyzing model organism data directly with human GWAS are under active development. One recent example identified novel brain mechanisms of alcohol use and dependence by coanalyzing human GWAS, human protein–protein interaction networks, and mouse gene co-expression data. In doing so, the researchers interrogated ethanol-responsiveness genes obtained from mouse gene expression data of the PFC, VTA, and NAc [45].
Despite this substantial progress, there remain conceptual and technical challenges for data integration across species. These occur at the levels of phenotypic comparison, genetic conservation, and computational scale. A major challenge at the phenomic level is that any effort to integrate evidence across model organisms and humans must acknowledge that human psychiatric diagnoses and classifications are often based upon clinical instruments and nosology that are not easily transferable to model organisms, therefore efforts to “diagnose” animal models are discouraged. However, it is apparent that aspects of a disorder can transfer across species and be easily captured with experimental data, and increasingly, GWAS of psychiatric disorders are providing corroborating support for variants that influence both disorders and their trait-like manifestations that may be recapitulated in model organisms [46]. For example, it was recently shown that ethanol responsive genes in mouse prefrontal cortex, nucleus accumbens and ventral tegmental area were overrepresented in GWAS for alcohol dependence in the Irish Affected Sib-Pair Study of Alcohol Dependence and the Avon Longitudinal Study of Parents and Children [26]. The identification of network-level associations between humans and mice suggests shared sensitivity in ethanol responding, and thus can serve as support for nominal GWAS signals. However, far more complexity and heterogeneity than ethanol response underlies alcohol dependence in humans. Recent genomic distinctions identified between the consumption (AUDIT-C items 1–3) and the problematic (AUDIT-P items 7–10) subscales of the Alcohol Use Disorder Inventory Test (AUDIT) [8], [47] echo similar findings in model systems, the data from which will be critical for the interpretation of molecular mechanisms [48].
There is concern that comparative, multispecies approaches will not be as readily feasible for certain psychiatric traits. Behavioral characteristics including speech, language, and certain executive and metacognitive functions are also impossible to assess in model organisms. It should be noted that, most studies that attempt comparative genomics across species are based on limited genetic diversity, often comparing a single idiosyncratic strain to a small sample of the population of humans, e.g., [49], and therefore cannot discern between individual differences within populations and between species. For some disorders, there is a substantial role of brain structures that are under developmental control of poorly conserved genomic regions, leading to significant cross-species differences in these structures [50]. This potentially could preclude detection of genetic variants that regulate disorders through effects on the development of these structures. Following this logic, some aspects of substance use disorders are served by neural structures that show more conservation and may be more likely to provide convergent mechanistic evidence for overt characteristics of drug intake, withdrawal, compulsive responding even with choice and punishment, but perhaps not “desire to quit” or other metacognitive and psychosocial aspects of addiction.
However, all psychiatric disorders including SUDs are highly complex traits likely involving many risk loci. Some of these effects are manifest across species, even if the end result in humans includes behavioral output not readily observable in nonhuman model organisms. Therefore, one can model the effects of genetic risk variants on more proximal biological consequences; for example, one might study the influence of C4 variation [43] on endophenotypes captured in Research Domain Criteria (RDoCs) including synaptic excitability, or neuronal reactivity and the various startle phenotypes it is associated with, but not all of the species-specific cognitive and behavioral output that are central to the disease pathology. Historically, the field has been distracted by pharmacologically predictive characteristics that have little face validity with the disorders to which they are applied [51]. Below we describe how cross-species comparative genomics provides a tool that can be used to identify what aspects of the human disorder are reflected in model organism genomics, allowing data-driven discovery of the relations among traits across species [52].
At the genetic conservation level, cross-species genetic research has been hindered by the “analytic currency” problem. Human geneticists typically work at the variant level, and genomics data, particularly from expression studies, are often reported at the gene or transcript level. Prior efforts at model organism follow-up of human GWAS data were limited to human variants that could be positionally assigned to a gene, but as described below, this is no longer the case. As is evident from regulatory mapping analyses, the action of a variant does not readily correspond to the most proximal gene, or even a single gene. Further compounding the problem, noncoding regulatory variants are often found in poorly conserved regions of the genome, which renders cross-species gene orthology mapping challenging and variant mapping through sequence alone, impossible in many cases. Therefore, approaches that exploit both gene orthology and convergence of variant regulatory relations are most promising toward relating trait regulatory variation across species.
In the case of intragenic variants, current methods use transcript and protein annotations to identify causal SNPs based on the severity of mRNA and protein modifications [53] and other functional consequences [54]. However, the majority of SNPs are intergenic, suggesting the involvement of distal gene-regulatory mechanisms (e.g., chromatin accessibility). Therefore, the common approach of associating SNPs to nearby downstream and upstream genes can elicit false positives [55] and therefore it is necessary to use data from gene eQTLs, epigenetics, and 3D genomics to assess the relationships among regulatory variants and their distal targets. Although most prior variant-to-gene annotation efforts have relied on positional approaches, i.e., assigning SNPs to genes based solely on physical proximity (e.g., MAGMA software [56]), modern approaches in humans rely on extensively curated functional and regulatory mapping from ‘omics data (e.g., S-PrediXcan software [57], TWAS [58], Hi-C coupled MAGMA or H-MAGMA software [59]).
However, all of these approaches have almost exclusively used data from human genomic analyses. Similar approaches have been deployed in model organisms, but the integration of resources across species has remained rather incomplete, limiting the approach to a small number of applications. To facilitate cross-species analysis, integrative data analyses have historically relied on gene homology associations from model organism databases [60] and gene orthology services [61]. Analysis involving multiple species therefore most often occurs at the gene level, introducing a GWAS-specific integration challenge: the need to associate genetic variants with genes. For complex disorders, such as schizophrenia and SUDs, this often requires characterization of the regulatory nature of genetic variants associated with disease, or identifying functional variants in submolecular domains of drug targets that could confer vulnerability or resistance to various treatments. However, noncoding regions of the genome are often very poorly conserved across species, and the targets of the variants can be far away. Moreover, many of the implicated noncoding variants in GWAS reside in gene expression regulatory regions [62]. Here, we highlight solutions for the assessment of conserved effects of variants through their orthologous genomic targets to support a wide-range of applications in integrative functional genomics (Fig. 1).
Solutions for data-driven cross-species analysis
Broadly speaking, integration of multispecies functional genomic data can occur in two ways—from the phenomic or genomic orientation. For example, top-down, trait-based approaches to cross-species analysis utilize the similarity of human disease-related phenotypic profiles to model organism phenotypic profiles to identify gene-disease associations [63]. These approaches, embodied in resources developed by The Monarch Initiative [64] identify similar phenotypes across species through integrated ontologies and semantic similarity methodologies that apply semantic reasoners to a unified knowledge graph [65]. Such phenotype-driven approaches, which leverage multispecies data, have been effective at assisting rare disease diagnosis [66] and improving identification of causal genetic mechanisms [67], but these approaches are challenging to apply in the context of high phenotypic and genetic heterogeneity due to the extensive differences among species in the behavioral manifestations of neurobiological variation.
In highly complex psychiatric disorders in which model organism traits may only capture a facet of the human disease, alternative bottom-up strategies that aggregate genomic data may be more suitable for identification of the driving genetic mechanisms associated with complex traits and disease. The varieties of biological entities—genes, proteins, variants, methylation sites, and chromatin states for example, which can be characterized via genome-wide experimentation, pose a challenge for integration and analytic efforts [68]. These challenges may be mitigated via combinatorial integration of fundamental data attributes into generalized data structures that can be mined for patterns or emergent gene-disease relationships. GeneWeaver [69], for example, relies on a bipartite data model [70] and heterogeneous data networks [71] to integrate and analyze functional genomics data such as differential expression studies, GWAS, curated annotations, and QTL mapping studies through a single data structure that facilitates aggregation of information. Harmonizome [72], on the other hand, aggregates functional genomics studies from a variety of sources by implementing an association matrix across shared attributes and relying on machine learning approaches to identify novel patterns.
Fundamental integration through knowledge graphs may also be applied to large-scale heterogenous analysis. KnowEng [73] uses a knowledge network to navigate the integration of statistical experimental data and contextualized user information to identify human and mouse interactions. Aggregated knowledge networks can be analyzed using traditional network mining approaches or machine learning. Other tools, such as HumanBase [74] or the DIAMOnD [75] algorithm, also take advantage of traversing large ad hoc networks of functional connectivity. Networks are navigated through machine learning or association matrices to connect multispecies gene or variant relationships.
There are many approaches to cross-species comparative genomics and phenomics integration (e.g., Table 2) and analysis must optimize among competing needs of computing scalability, data accessibility, and data scope. For example, the sheer number of variants in humans and rodents and the unbounded phenotype dimension lead to the problem of phenomenal computational scale. The tremendous heterogeneity of model organism datasets, from mutation characterization studies, curated pathway and gene annotation sets, and extensive genetic and genomic data at the level of genes and variants, presents a problem of size, scope, and complexity, in the realm of big data problems, requiring computationally scalable solutions.
Big data and the integration of human and model organism studies in psychiatric genetics
Cross-species analysis typically happens at the level of abstracted relations among variants or genes and can thus be quite reduced in scale. However, (1) the scope of genomic studies is completely unbounded and it is possible to find hundreds, if not thousands of animal studies of disease-relevant neurobiology and (2) the parsing and representation of genomic variants from diverse data sources and their mappings onto one another does not scale so easily. Retaining this traceable mapping while allowing integrative and interactive analysis is a problem of high complexity and scale. The storage, analysis, distribution, and integration of human and model organism functional genomic data are especially challenging, as they embody typical problems encountered in the big data world [76] often referred to as the four V’s of data—volume, variety, velocity, and veracity.
The sheer volume of data required to support comprehensive cross-species data integration of genes and individual variants is staggering. For example, if we assume that the average number of coding genes in mammalian genomes is ~25,000, then constructing rudimentary connections among the genes in five species would produce 1/2n(n−1) relationships, where n is the number of genes in the network. If represented as a graph, with each edge representing a relationship, the graph would be enormous but tractable, comprising ~7.8E9 edges. But, the genome is only one dimension of the problem. The other is the sheer number of contexts in which that genome is experimentally profiled. With thousands of human and model organism addiction genomics datasets, and hundreds of thousands of species-specific pathway data, brain regional transcriptomes and other relevant data resources, one quickly reaches a problem requiring scalable solutions. Analysis of a handful of organisms can therefore be handled with large, conventional high-performance computing systems. At the variant level, however, the relationship problem is greatly compounded. Known variants, which outnumber genes within the typical model organisms by more than 20,000–1, would naively require ~1.25E17 edge relationships. While intelligent approaches for computing on large graphs, such as taking advantage of partitioning [77], sparse connectivity [78], or heuristics [79], can aid in the management and analysis of these relationships, exhaustive examination of static graphs of this potential size is intractable due to computing limitations, storage, and real-time accessibility. As the number of genomic experiments continues to grow, particularly in the model organism space, one viable option may be the dynamic analysis of datasets using elastic on-demand cloud services that make use of horizontally scalable computing to efficiently distribute computing tasks to address very specific questions.
A corollary to the volume/variety of data associated with variant mapping across species is the velocity at which it is produced, and, subsequently, the rate at which it must be collated, curated, and made accessible. With over 4500 eukaryotic genomes assembled over the last decade [80], it has been argued that genome-scale data will be bigger than Big Data associated with astronomy, YouTube, and Twitter by 2025 [76]. To complicate the processes used to integrate the vast scope of data are data sharing policies that historically do not require automated sharing of model organism data, resulting in data analysis processes that result primarily from ad hoc relationships [81]. To mitigate the stresses imposed by data velocity, it is critical to devise a means to access, integrate, and dynamically update these data in a manner that avoids redundancies and keeps data provenance intact. While it is inevitable that there will be an uneven integration of data from a variety of sources, it is incumbent on the bioinformatics community to create systems to rapidly track intentional methodologies for data cleaning and reduction through the discovery of duplicated or deprecated data.
By addressing these problems in Big Data, scalable applications in integrative functional genomics for psychiatric genomics are enabled (Fig. 2). The integrated, global mapping of trait regulatory variants across species through target genes can facilitate the integration of model organism genomic data to fill the mechanistic knowledge gap between noncoding human genetic variant and human disease. This integration can be accomplished through the aggregation of curated and high-throughput experimental data from multiple domain-specific resources. Data resources such as GTEx [29], ENCODE [82], and Roadmap Epigenomics [83] provide extensive coverage of genomic regulatory features and gene-regulatory mechanisms. High-level regulatory features including CTCF binding sites, enhancers, open chromatin, promoter, promoter flanking, and transcription factor binding site attributes can all be retrieved from regulation databases [84]. These features can be annotated to genomic variants from the Ensembl variation database [85], for example, to identify regulatory variants within regions of interest. Identifying putative regulatory interactions between regulatory variants and genes can be accomplished through layering several approaches. Topologically associated domains, verified from Hi-C studies and integrated from published studies and the ENCODE resource, can be used to delineate putative gene-regulatory boundaries and all combinations of regulatory variants and genes that are associated within the boundary. Experimentally confirmed feature-gene interactions mediated by RNA polymerase II and identified using ChIA-PET studies, sourced from ENCODE and various publications, can also be used. Finally, eQTLs can identify variant influences on specific genes.
Compounding the issues encountered by the complexity of raw data is the potential for underlying data bias and the subsequent difficulty of attributing veracity to the data. There is an implicit bias in the sampling of genes represented in an experimentally derived genomic data set because each genomic technology and especially a curated genomic data resource is based on a different breadth, e.g., individual mutation studies curated from literature by the model organism databases vs. genome-wide gene expression by RNA-seq data. Differing approaches affect the rate of false positives in the data set. For example, QTL positional candidate sets may have many genes with likely only one or a few true positives, in contrast to differential gene expression sets for which the statistical threshold defines a false discovery rate. Semiquantitative or quantitative scores for these datasets need to be created to reduce our reliance on qualitative scoring. Enrichment analyses and systems genetic correlation tools suffer from annotation bias in that one often retrieves results representing areas of investigation that are dense with information, resulting in apparent patterns and trends that are an artifact of coverage. Data resources like GTEx also suffer from biases based on uneven sample size, and the particular tissues and conditions investigated. The net effect of the uneven statistical power in these data resources is to upwardly bias well-powered but less relevant findings, in which tissues or phenotypes are spuriously associated with disease. Therefore, it is important to consider error-rate controls, and other procedures, but also the uniformity of analysis in the data used in analysis.
Multitissue eQTL data can be integrated to provide context-specific variant mapping. Primarily derived from the GTEx project or model organism resources such as GeneNetwork [30], data from mouse, rat, and human genetics experiments represent a diverse and deep pool of data. Single cell RNA (scRNA) enables the exciting possibility to investigate eQTLs and gene co-expression in complex, multicellular tissues. For example, scRNA sequences have been used to create high fidelity classifications of brain regions based on local variants [86].
Furthermore, scRNA has been used to identify cell-type-specific cis-eQTLs and variant co-expression networks [87]. Gene expression genetics studies in model organisms have tremendous precision with new populations like the Diversity Outbred segregating 45 million mouse genetic variants comprising 90% of the known mouse genetic variome [88]. Recombinations are at extremely high precision, and large mapping population sample sizes for an increasing number of brain regions, and the derivation of this population from eight founder strains provides a means of reducing eQTLs to a small handful of regulatory variants at the SNP level [89]. As such, it is possible to identify eQTL variants that may affect one of several gene-regulatory mechanisms targeting a human orthologue, and to assess its effect on mouse phenomics, cellular gene expression, or other endpoints in silico, in vitro, or in vivo. Many of these tools provide browser-based and limited scriptable interfaces with continued adoption of new technologies, but exposing model organism eQTL data to large-scale dynamic tools for graphical integration would be of tremendous utility in readily enabling facile interrogation of variant-gene relations.
Multiple applications are readily possible with integrated data structures
A compelling approach to the prioritization of GWAS variants enabled by Big Data integration is the use integrated cross-species data to identify and characterize those variants with a known mechanistic role in neurobiological pathways to disease, or to identify human variants with highly specific hypothesized roles in particular cases of disease, such as the widely studied ADH1B in AUDs. Although current applications and analytic implementations do not fully take advantage of large-scale data resources, the emerging scale of data and high-volume comparative analyses will most certainly merit scalable approaches in the near future. Most present approaches do not yet harness the full capacity of cross-species comparative analyses at scale, and initial applications have been necessarily focused on small, single locus problems. However, these simple applications are ripe for extrapolation to global questions about the neurobiological mechanisms of addiction. One promising application of multispecies epigenomic integration is comparative gene regulation. Now that characterization of gene-regulatory components (e.g., enhancers, TF binding sites) and their putative gene targets is improving, integrative methods can identify shared genomic regulators across species. In one example, from studies on alcohol dependence and cholesterol, at least 4000 SNPs from human GWAS can be mapped onto the mouse genome [90]. Furthermore, some of these SNPs, which are involved in human liver function, can be mapped to liver-specific enhancers in mice [91]. This type of comparative analysis could be used to identify convergent regulatory features and variants across species, enabling the development of mouse models for testing SNP causality in humans. Integrative systems have successfully been used to identify disease-relevant genes and to identify gene-regulatory SNPs involved in alcohol preference and withdrawal involved in epigenetic regulation in mice at a distal enhancer element [92]. Query of public genetic data resources indicates that variation in the same gene occurs in humans, via a promoter variant, rather than an enhancer [93].
Several recent approaches have been developed for prioritization of disease-relevant genes and variants from integrative omics analysis. These tools utilize large integration pipelines coupled to networking and statistical tools to establish a relative importance (e.g., priority indexing) of variants across tissues of interest focusing on immune-mediated traits. For example, Wang et al., develop a risk gene selection method, called iRIGs [18], which incorporates GWAS and a number of genomic features including expression, chromatin interactions, and gene-regulatory data into a Bayesian framework for prioritization. This framework prioritizes genes within a small 2 Mb region near risk loci identified from GWAS using a select set of epigenetics including promoter, enhancer, and chromatin interactions from Hi-C studies. A similar approach, developed by Fang et al., utilizes a priority index (Pi) pipeline [94] designed to prioritize genes from GWAS variants for specific immune traits. Pi combines genomic predictors in the form of gene proximities, chromatin interactions, and expression modulation evidence (eQTLs) with network-based models to prioritize trait–gene associations. To date, these approaches have not been applied to model organism data, but they most certainly can be. Furthermore, with the implementation of cross-species variant mapping such as those presented in Fig. 1, they can exploit the broad, heterogeneous multispecies data corpus.
Another application is to compare sets of trait-associated human and model organism genomic data to identify similarly regulated disease-relevant traits suitable for convergent validation experiments. Mapping of human disease-related characteristics onto model organism behaviors has been a controversial area of research, and for many, the perceived relevance of animal models is hindered by ever-refined definitions of face validity [95]. This argument misses the point that a model is by definition a simplification of a system that renders it amenable to particular types of study, including validation. Animal models, themselves, have successfully been used to measure the efficacy of drugs and validate various drug targets [51]. Further, there may be sufficient consilience between human disease traits (such as the various aspects of alcohol use disorders) and those modeled in animals (e.g., ethanol intake) at a genomic level (rg = 0.77 between problem drinking and typical alcohol intake [3]) to allow for careful cross-species data integration for these subfacets of human disease. Research targeting behavioral mechanisms that do converge across species does not discount or diminish the need to study the remaining complexity in the human phenotype. Rather, it serves as a powerful means of discovery of the nature of vulnerability and resilience to those components of psychiatric disorders that, in their many manifestations and potentially relevant classifications, are amenable to biological insights, and thus, promising targets for therapeutic discovery.
Finally, the prioritization of variants for use in polygenic risk analysis can be refined. Savvy integrative methods can be combined to achieve sets of variants that meaningfully contribute to trait variation from a broad network of genes. Aggregating across the tools and databases listed in the current review will help researchers to match (1) variants to genes, (2) genes to biological functions, (3) functions to plausible molecular mechanisms—ultimately achieving more robust effects with high signal-to-noise ratios—and (4) traits and disease characteristics within and across species. A few studies [96] have constructed polygenic scores from variants in genes known for disease pathology or targets co-expressed with putative trait genes from relevant brain tissues (via GeneNetwork), both of which demonstrated increased prediction than a random sets of genes and achieved trait specificity in mice [96] and humans [97]. But not all biologically informed polygenic scores exhibit significant prediction [98] and these methods have not been benchmarked with classical approaches selecting specific statistical criterion (e.g., p value threshold; PRSice [99]) nor approaches that combine both statistical and alternative biological information (e.g., LD; LDpred [100], PleioPRED [101], AnnoPred [21]). A mixture of these techniques is likely required to best inform gene and variant prioritization in human GWAS studies.
Future research directions
The multiple strategies we have outlined can be used to address the challenges and opportunities for the integration of diverse model organism datasets to augment the interpretation of GWAS and define genes and molecular pathways that underlie aspects of psychiatrically relevant phenotypes. Heterogeneous functional genomics leverages the combined information in population genetic diversity, systems biology, gene-regulatory analysis, and advanced phenotypic measurements to identify and characterize mechanisms of psychiatric disorders of the greatest complexity. Much work remains to facilitate dynamic data integration across these data types. The continued generation of adequately powered and broadly unbiased data resources in neurogenomics is essential across multiple species. Data sharing policies and practices along with platforms for data sharing and data integration are required. Community standards and practices that make data findable, accessible, interoperable, and reproducible need to be adopted and resourced so that all researchers engaged in the generation and analysis of integrative functional genomics data have the capability of contributing to and benefiting from data integration. Development of analytic approaches and algorithms are also required for diverse applications in functional genomic data integration. Scalable computational solutions that allow for such high dimensional data integration will enable a growing array of tools and approaches for the discovery of unknown mechanisms underlying psychiatric disorders, providing a more complete understanding of disease mechanisms.
Funding and disclosure
AA and ECJ received support from NIH MH109532; DA32573 (AA); F32AA027435 (ECJ). EJC, TR, EJB, and JAB received support from NIH AA018776; and EJC from P50 DA039841. RHCP, JAB, and SH received support from NIH DA042103. The authors declare no competing interests.
References
Smeland OB, Frei O, Fan C-C, Shadrin A, Dale AM, Andreassen OA. The emerging pattern of shared polygenic architecture of psychiatric disorders, conceptual and methodological challenges. Psychiatr Genet. 2019;29:152–9. https://doi.org/10.1097/YPG.0000000000000234.
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511:421–7. https://doi.org/10.1038/nature13595.
Zhou H, Sealock JM, Sanchez-Roige S, Clarke T-K, Levey D, Cheng Z, et al. Meta-analysis of problematic alcohol use in 435,563 individuals identifies 29 risk variants and yields insights into biology, pleiotropy and causality. bioRxiv. 2019. https://doi.org/10.1101/738088.
Howard DM, Adams MJ, Clarke T-K, Hafferty JD, Gibson J, Shirali M, et al. Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nat Neurosci. 2019;22. https://doi.org/10.1038/s41593-018-0326-7.
Cross-Disorder Group of the Psychiatric Genomics Consortium. Genomic relationships, novel loci, and pleiotropic mechanisms across eight psychiatric disorders. Cell. 2019;179:1469–82.e11. https://doi.org/10.1016/j.cell.2019.11.020.
Liu M, Jiang Y, Wedow R, Li Y, Brazel DM, Chen F, et al. Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use. Nat Genet. 2019;51:237–44. https://doi.org/10.1038/s41588-018-0307-5.
Sanchez-Roige S, Fontanillas P, Elson SL, Gray JC, de Wit H, MacKillop J, et al. Genome-wide association studies of impulsive personality traits (BIS-11 and UPPS-P) and drug experimentation in up to 22,861 adult research participants identify loci in the CACNA1I and CADM2 genes. J Neurosci. 2019;39:2562–72. https://doi.org/10.1523/JNEUROSCI.2662-18.2019.
Kranzler HR, Zhou H, Kember RL, Vickers Smith R, Justice AC, Damrauer S, et al. Genome-wide association study of alcohol consumption and use disorder in 274,424 individuals from multiple populations. Nat Commun. 2019;10:1499 https://doi.org/10.1038/s41467-019-09480-8.
Gallagher MD, Chen-Plotkin AS. The post-GWAS era: from association to function. Am J Hum Genet. 2018;102:717–30. https://doi.org/10.1016/j.ajhg.2018.04.002.
Sullivan PF, Geschwind DH. Defining the genetic, genomic, cellular, and diagnostic architectures of psychiatric disorders. Cell. 2019;177:162–83. https://doi.org/10.1016/j.cell.2019.01.015.
Wang M, Xu S. Statistical power in genome-wide association studies and quantitative trait locus mapping. Heredity. 2019;123:287–306. https://doi.org/10.1038/s41437-019-0205-3.
Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348 https://doi.org/10.1371/journal.pgen.1003348.
Wainschtein P, Jain DP, Yengo L, Zheng Z, Cupples LA, Shadyab AH, et al. Recovery of trait heritability from whole genome sequence data. 2019. https://doi.org/10.1101/588020.
Caspi A, Houts RM, Belsky DW, Goldman-Mellor SJ, Harrington H, Israel S, et al. The p factor: one general psychopathology factor in the structure of psychiatric disorders? Clin Psychol Sci. 2014;2:119–37. https://doi.org/10.1177/2167702613497473.
Watanabe K, Taskesen E, van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8:1826–8. https://doi.org/10.1038/s41467-017-01261-5.
Ning S, Zhao Z, Ye J, Wang P, Zhi H, Li R, et al. LincSNP: a database of linking disease-associated SNPs to human large intergenic non-coding RNAs. BMC Bioinform. 2014;15:152 https://doi.org/10.1186/1471-2105-15-152.
Wu Y, Zeng J, Zhang F, Zhu Z, Qi T, Zheng Z, et al. Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits. Nat Commun. 2018;9. https://doi.org/10.1038/s41467-018-03371-0.
Wang Q, Chen R, Cheng F, Wei Q, Ji Y, Yang H, et al. A Bayesian framework that integrates multi-omics data and gene networks predicts risk genes from schizophrenia GWAS data. Nat Neurosci. 2019;22:691–9. https://doi.org/10.1038/s41593-019-0382-7.
Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10:e1004722 https://doi.org/10.1371/journal.pgen.1004722.
Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nat Genet. 2019;51. https://doi.org/10.1038/s41588-019-0367-1.
Hu Y, Lu Q, Powles R, Yao X, Yang C, Fang F, et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput Biol. 2017;13:e1005589. https://doi.org/10.1371/journal.pcbi.100558.
Márquez-Luna C, Gazal S, Loh PR, Kim SS, Furlotte N, Auton A, et al. LDpred-funct: incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. 2020. https://doi.org/10.1101/375337.
Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights. 2020;14. https://doi.org/10.1177/1177932219899051.
Kichaev G, Bhatia G, Loh PR, Gazal S, Burch K, Freund MK, et al. Leveraging polygenic functional enrichment to improve GWAS power. Am J Hum Genet. 2019;104:65–75. https://doi.org/10.1016/j.ajhg.2018.11.008.
Palmer RHC, Benca-Bachman CE, Bubier JA, McGeary JE, Ramgiri N, Srijeyanthan J, et al. Cross-species integration of transcriptomic effects of tobacco and nicotine exposure helps to prioritize genetic effects on human tobacco consumption. 2019. https://doi.org/10.1101/2019.12.23.887083.
Mignogna KM, Bacanu SA, Riley BP, Wolen AR, Miles MF. Cross-species alcohol dependence-associated gene networks: co-analysis of mouse brain gene expression and human genome-wide association data. PLoS ONE. 2019;14:e020206. https://doi.org/10.1371/journal.pone.0202063.
Chesler EJ, Lu L, Shou S, Qu Y, Gu J, Wang J, et al. Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005;37:233–42. https://doi.org/10.1038/ng1518.
Crabbe JC. Progress with nonhuman animal models of addiction. J Stud Alcohol Drugs. 2016;77:696–9. https://doi.org/10.15288/jsad.2016.77.696.
GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat Genet. 2013;45:580–5. https://doi.org/10.1038/ng.2653.
Mulligan MK, Mozhui K, Prins P, Williams RW. GeneNetwork: a toolbox for systems genetics. Methods Mol Biol. 2017;1488:75–120. https://doi.org/10.1007/978-1-4939-6427-7_4.
PsychENCODE Consortium, Akbarian S, Liu C, Knowles JA, Vaccarino FM, Farnham PJ, et al. The PsychENCODE project. Nat Neurosci. 2015;18:1707–12. https://doi.org/10.1038/nn.4156.
Celniker SE, Dillon LAL, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, et al. Unlocking the secrets of the genome. Nature. 2009;459:927–30. https://doi.org/10.1038/459927a.
Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucl Acids Res. 2002;30:207–10. https://doi.org/10.1093/nar/30.1.207.
Dickinson ME, Flenniken AM, Ji X, Teboul L, Wong MD, White JK, et al. High-throughput discovery of novel developmental phenotypes. Nature. 2016;537:508–14. https://doi.org/10.1038/nature19356.
Durrant C, Swertz MA, Alberts R, Arends D, Möller S, Mott R, et al. Bioinformatics tools and database resources for systems genetics analysis in mice—a short review and an evaluation of future needs. Brief Bioinform. 2012;13:135–42. https://doi.org/10.1093/bib/bbr026.
Bogue MA, Philip VM, Walton DO, Grubb SC, Dunn MH, Kolishovski G, et al. Mouse Phenome Database: a data repository and analysis suite for curated primary mouse phenotype data. Nucleic Acids Res. 2020;48:D716–23. https://doi.org/10.1093/nar/gkz1032.
Bogue MA, Grubb SC, Walton DO, Philip VM, Kolishovski G, Stearns T, et al. Mouse Phenome Database: an integrative database and analysis suite for curated empirical phenotype data from laboratory mice. Nucleic Acids Res. 2018;46:D843–50. https://doi.org/10.1093/nar/gkx1082.
Nishiguchi M, Kinoshita H, Mostofa J, Taniguchi T, Ouchi H, Minami T, et al. Different blood acetaldehyde concentration following ethanol administration in a newly developed high alcohol preference and low alcohol preference rat model system. Alcohol Alcohol. 2002;37:9–12. https://doi.org/10.1093/alcalc/37.1.9.
Oberlin B, Best C, Matson L, Henderson A, Grahame N. Derivation and characterization of replicate high- and low-alcohol preferring lines of mice and a high-drinking crossed HAP line. Behav Genet. 2011;41:288–302. https://doi.org/10.1007/s10519-010-9394-5.
Bergeson SE, Kyle Warren R, Crabbe JC, Metten P, Gene Erwin V, Belknap JK. Chromosomal loci influencing chronic alcohol withdrawal severity. Mamm Genome. 2003;14:454–63. https://doi.org/10.1007/s00335-002-2254-4.
Adkins AE, Hack LM, Bigdeli TB, Williamson VS, McMichael GO, Mamdani M, et al. Genomewide association study of alcohol dependence identifies risk loci altering ethanol-response behaviors in model organisms. Alcohol Clin Exp Res. 2017;41:911–28. https://doi.org/10.1111/acer.13362.
Schumann G, Liu C, O’Reilly P, Gao H, Song P, Xu B, et al. KLB is associated with alcohol drinking, and its gene product β-Klotho is necessary for FGF21 regulation of alcohol preference. Proc Natl Acad Sci USA. 2016;113:14372–7. https://doi.org/10.1073/pnas.1611243113.
Sekar A, Bialas AR, de Rivera H, Davis A, Hammond TR, Kamitaki N, et al. Schizophrenia risk from complex variation of complement component 4. Nature. 2016;530:177–83. https://doi.org/10.1038/nature16549.
Rangaraju S, Dammer EB, Raza SA, Rathakrishnan P, Xiao H, Gao T, et al. Identification and therapeutic modulation of a pro-inflammatory subset of disease-associated-microglia in Alzheimer’s disease. Mol Neurodegener. 2018;13:24 https://doi.org/10.1186/s13024-018-0254-8.
Wolen AR, Phillips CA, Langston MA, Putman AH, Vorster PJ, Bruce NA, et al. Genetic dissection of acute ethanol responsive gene networks in prefrontal cortex: functional and mechanistic implications. PLoS ONE. 2012;7:e33575 https://doi.org/10.1371/journal.pone.0033575.
Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, Fontana MA, et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet. 2018;50:229–37. https://doi.org/10.1038/s41588-017-0009-4.
Sanchez-Roige S, Palmer AA, Fontanillas P, Elson SL, 23andMe Research Team, the Substance Use Disorder Working Group of the Psychiatric Genomics Consortium, Adams MJ, et al. Genome-wide association study meta-analysis of the alcohol use disorders identification test (AUDIT) in two population-based cohorts. Am J Psychiatry. 2019;176:107–18. https://doi.org/10.1176/appi.ajp.2018.18040369.
Crabbe JC. Translational behaviour-genetic studies of alcohol: are we there yet? Genes Brain Behav. 2012;11:375–86. https://doi.org/10.1111/j.1601-183X.2012.00798.x.
Zhang H-L, Long J-W, Han W, Wang J, Song W, Lin GN, et al. Comparative analysis of cellular expression pattern of schizophrenia risk genes in human versus mouse cortex. Cell Biosci. 2019;9:89. https://doi.org/10.1186/s13578-019-0352-5.
van den Heuvel MP, Scholtens LH, de Lange SC, Pijnenburg R, Cahn W, van Haren NEM, et al. Evolutionary modifications in human brain connectivity associated with schizophrenia. Brain. 2019;142:3991–4002. https://doi.org/10.1093/brain/awz330.
Krishnan V, Nestler EJ. Animal models of depression: molecular perspectives. Curr Top Behav Neurosci. 2011;7:121–47. https://doi.org/10.1007/7854_2010_108.
Baker EJ, Jay JJ, Philip VM, Zhang Y, Li Z, Kirova R, et al. Ontological discovery environment: a system for integrating gene-phenotype associations. Genomics. 2009;94:377–87. https://doi.org/10.1016/j.ygeno.2009.08.016.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl variant effect predictor. Genome Biol. 2016;17:122. https://doi.org/10.1186/s13059-016-0974-4.
Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. https://doi.org/10.1093/nar/gkq603.
Brodie A, Azaria JR, Ofran Y. How far from the SNP may the causative genes be? Nucleic Acids Res. 2016;44:6046–54. https://doi.org/10.1093/nar/gkw500.
de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol. 2015;11:e1004219. https://doi.org/10.1371/journal.pcbi.1004219.
Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun. 2018;9. https://doi.org/10.1038/s41467-018-03621-1.
Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016;48:245–52. https://doi.org/10.1038/ng.3506.
Sey NYA, Fauni H, Ma W, Won H, Connecting gene regulatory relationships to neurobiological mechanisms of brain disorders. 2019. https://doi.org/10.1101/681353.
Alliance of Genome Resources Consortium. Alliance of genome resources portal: unified model organism research platform. Nucleic Acids Res. 2020;48:D650–8. https://doi.org/10.1093/nar/gkz813.
Sonnhammer ELL, Östlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res. 2015;43:D234–9. https://doi.org/10.1093/nar/gku1203.
Marigorta UM, Rodríguez JA, Gibson G, Navarro A. Replicability and prediction: lessons and challenges from GWAS. Trends Genet. 2018;34:504–17. https://doi.org/10.1016/j.tig.2018.03.005.
Washington NL, Haendel MA, Mungall CJ, Ashburner M, Westerfield M, Lewis SE. Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol. 2009;7:e1000247. https://doi.org/10.1371/journal.pbio.1000247.
Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, et al. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017;45:D712–22. https://doi.org/10.1093/nar/gkw1128.
Smedley D, et al. PhenoDigm: analyzing curated annotations to associate animal models with human diseases. Database. 2013;2013:bat025. https://doi.org/10.1093/database/bat025.
Bone WP, et al. Computational evaluation of exome sequence data using human and model organism phenotypes improves diagnostic efficiency. Genet Med. 2016;18:608–17. https://doi.org/10.1038/gim.2015.137.
Robinson PN, Köhler S, Oellrich A, Sanger Mouse Genetics Project, Wang K, Mungall CJ, et al. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 2014;24:340–8. https://doi.org/10.1101/gr.160325.113.
Bubier JA, Phillips CA, Langston MA, Baker EJ, Chesler EJ. GeneWeaver: finding consilience in heterogeneous cross-species functional genomics data. Mamm Genome. 2015;26:556–66. https://doi.org/10.1007/s00335-015-9575-x.
Baker E, Bubier JA, Reynolds T, Langston MA, Chesler EJ. GeneWeaver: data driven alignment of cross-species genomics in biology and disease. Nucleic Acids Res. 2016;44:D555–9. https://doi.org/10.1093/nar/gkv1329.
Baker EJ, Jay JJ, Bubier JA, Langston MA, Chesler EJ. GeneWeaver: a web-based system for integrative functional genomics. Nucleic Acids Res. 2012;40:D1067–76. https://doi.org/10.1093/nar/gkr968.
Reynolds T, Bubier JA, Langston MA, Chesler EJ, Baker EJ. Finding human gene-disease associations using a Network Enhanced Similarity Search (NESS) of multi-species heterogeneous functional genomics data. https://doi.org/10.1101/2020.03.11.987552.
Rouillard AD, Gundersen GW, Fernandez NF, Wang Z, Monteiro CD, McDermott MG, et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database. 2016;2016. https://doi.org/10.1093/database/baw100.
Sinha S, Song J, Weinshilboum R, Jongeneel V, Han J. KnowEnG: a knowledge engine for genomics. J Am Med Inf Assoc. 2015;22:1115–9. https://doi.org/10.1093/jamia/ocv090.
Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, et al. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015;47:569–76. https://doi.org/10.1038/ng.3259.
Ghiassian SD, Menche J, Barabási A-L. A DIseAse MOdule Detection (DIAMOnD) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. PLoS Comput Biol. 2015;11:e1004120. https://doi.org/10.1371/journal.pcbi.1004120.
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13:e1002195. https://doi.org/10.1371/journal.pbio.1002195.
Boman EG, Devine KD, Rajamanickam S. Scalable matrix computations on large scale-free graphs using 2D graph partitioning. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. Denver, Colorado; 2013. p. 1–12. https://doi.org/10.1145/2503210.2503293.
Latapy M. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Computer Sci. 2008;407:458–73. https://doi.org/10.1016/j.tcs.2008.07.017.
Stanton I, Kliot G. Streaming graph partitioning for large distributed graphs. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China; 2012. p. 1222–30. https://doi.org/10.1145/2339530.2339722.
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD, et al. GenBank. Nucleic Acids Res. 2018;46:D41–7. https://doi.org/10.1093/nar/gkx1094.
Field D, Sansone S-A, Collis A, Booth T, Dukes P, Gregurick SK, et al. ’Omics data sharing. Science. 2009;326:234–6. https://doi.org/10.1126/science.1180598.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. https://doi.org/10.1038/nature11247.
Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010;28:1045–8. https://doi.org/10.1038/nbt1010-1045.
Zerbino DR, Johnson N, Juetteman T, Sheppard D, Wilder SP, Lavidas I, et al. Ensembl regulation resources. Database. 2016;2016. https://doi.org/10.1093/database/bav119.
Hunt SE, McLaren W, Gil L, Thormann A, Schuilenburg H, Sheppard D, et al. Ensembl variation resources. Database. 2018;2018:01. https://doi.org/10.1093/database/bay119.
Poulin J-F, Tasic B, Hjerling-Leffler J, Trimarchi JM, Awatramani R. Disentangling neural cell diversity using single-cell transcriptomics. Nat Neurosci. 2016;19. https://doi.org/10.1038/nn.4366.
van der Wijst MGP, Brugge H, de Vries DH, Deelen P, Swertz MA, Franke L. Single-cell RNA sequencing identifies celltype-specific cis-eQTLs and co-expression QTLs. Nat. Genet. 2018;50. https://doi.org/10.1038/s41588-018-0089-9.
Roberts A, Pardo-Manuel de Villena F, Wang W, McMillan L, Threadgill DW. The polymorphism architecture of mouse genetic resources elucidated using genome-wide resequencing data: implications for QTL discovery and systems genetics. Mamm Genome. 2007;18:473–81. https://doi.org/10.1007/s00335-007-9045-1.
Skelly DA, Raghupathy N, Robledo RF, Graber JH, Chesler EJ. Reference trait analysis reveals correlations between gene expression and quantitative traits in disjoint samples. Genetics. 2019;212:919–29. https://doi.org/10.1534/genetics.118.301865.
Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature. 2014;515:355–64. https://doi.org/10.1038/nature13992.
Breschi A, Gingeras TR, Guigó R. Comparative transcriptomics in human and mouse. Nat Rev Genet. 2017;18:425–40. https://doi.org/10.1038/nrg.2017.19.
Bubier JA, Jay JJ, Baker CL, Bergeson SE, Ohno H, Metten P, et al. Identification of a QTL in Mus musculus for alcohol preference, withdrawal, and Ap3m2 expression using integrative functional genomics and precision. Genetics. 2014;197:1377–93. https://doi.org/10.1534/genetics.114.166165.
GTEx Consortium. Human genomics. the genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–60. https://doi.org/10.1126/science.1262110.
Fang H, ULTRA-DD Consortium, De Wolf H, Knezevic B, Burnham KL, Osgood J, et al. A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat Genet. 2019;51:1082–91. https://doi.org/10.1038/s41588-019-0456-1.
Nestler EJ, Hyman SE. Animal models of neuropsychiatric disorders. Nat Neurosci. 2010;13:1161–9. https://doi.org/10.1038/nn.2647.
Neuner SM, Heuer SE, Huentelman MJ, O’Connell KMS, Kaczorowski CC. Harnessing genetic complexity to enhance translatability of Alzheimer’s disease mouse models: a path toward precision medicine. Neuron. 2019;101:399–411.e5. https://doi.org/10.1016/j.neuron.2018.11.040.
Hari Dass SA, McCracken K, Pokhvisneva I, Chen LM, Garg E, Nguyen TTT, et al. A biologically-informed polygenic score identifies endophenotypes and clinical conditions associated with the insulin receptor function on specific brain regions. EBioMedicine. 2019;42:188–202. https://doi.org/10.1016/j.ebiom.2019.03.051.
Van der Auwera S, Wittfeld K, Shumskaya E, Bralten J, Zwiers MP, Onnink AMH, et al. Predicting brain structure in population-based samples with biologically informed genetic scores for schizophrenia. Am J Med Genet B Neuropsychiatr Genet. 2017;174:324–32. https://doi.org/10.1002/ajmg.b.32519.
Euesden J, Lewis CM, O’Reilly PF. PRSice: polygenic risk score software. Bioinformatics. 2015;31:1466–8. https://doi.org/10.1093/bioinformatics/btu848.
Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am J Hum Genet. 2015;97:576–92. https://doi.org/10.1016/j.ajhg.2015.09.001.
Hu Y, Lu Q, Liu W, Zhang Y, Li M, Zhao H. Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction. PLoS Genet. 2017;13:e1006836. https://doi.org/10.1371/journal.pgen.1006836.
Acknowledgements
The authors thank Stephen Krasinski of The Jackson Laboratory for assistance with this manuscript.
Author information
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Reynolds, T., Johnson, E.C., Huggett, S.B. et al. Interpretation of psychiatric genome-wide association studies with multispecies heterogeneous functional genomic data integration. Neuropsychopharmacol. 46, 86–97 (2021). https://doi.org/10.1038/s41386-020-00795-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41386-020-00795-5