Introduction

Obesity and T2D are major public health problems, and their rates are increasing. It has been reported that 40% of adults in the UK will have obesity by 20251, and the worldwide population with T2D will approach 600 million in the next 20 years2. Understanding the molecular mechanisms of these conditions is important to identify their therapeutic targets, but there has been limited success in identifying target genes because they are not genetic disorders in general aside from rare cases of clear genetic abnormalities, such as maturity onset diabetes of the young, Donohue syndrome, or Rabson-Mendenhall syndrome3. Another challenge is that they are generally not initiated from a single organ, unlike cancer. For example, a major mechanism of obtaining T2D is acquiring insulin resistance, which may involve the accumulation of various environmental factors and multiple organs such as adipose, liver, and muscle are involved in that process. These characteristics imply that obesity and T2D result from abnormal dynamic states of relevant biological functions rather than aberrations of certain driver genes, which has created challenges in searching for simple therapeutic targets. For this reason, approaches to medically treat obesity or T2D are more about controlling the phenotypes of subjects, such as reducing caloric intake or appetite for obesity and decreasing blood glucose levels, increasing sensitivity to insulin, increasing insulin secretion, or using insulin therapy for T2D, rather than curing the disease by eliminating its drivers or altering the metabolic status back to a normal state.

Considering that obesity and T2D are due to abnormal dynamic states of relevant biological functions, it can be challenging to find therapeutic targets that can be applied to all subjects, and it may be necessary to identify different points of intervention for different subjects as an abnormality of the same biological function can be achieved from multiple points of aberration of molecular activities. For this reason, understanding the overall mechanisms and identifying the therapeutic candidates of obesity and T2D in the general population requires studying cohorts of sufficient size that are large enough to include variances in metabolic phenotypes and potentially diverse driving mechanisms, along with comprehensive data that can represent the exact status of individual subjects, such as detailed phenotypes and multi-omic profiles (such as genomic, epigenetic, metabolic, proteomic profiles). However, research communities studying obesity and T2D lack such comprehensive data resources, which is unlike other diseases, such as cancer, where many comprehensive multi-omic data resources are publicly available.

Even though there are no comprehensive data resources for obesity and T2D, individual studies can constitute certain aspects of comprehensive data collections. This review will discuss currently available genomic data resources that can be utilized to identify therapeutic candidates for obesity and T2D, including GWAS, a KO-based phenotyping study, and gene expression studies that have observed expression changes in subjects with obesity and type 2 diabetes across relevant organs. The included data sets range from individual studies to large data sets curated by many international consortia. Utilizing these data sets and considering their characteristics can be an alternative approach that mimics comprehensive molecular profiling and provides a useful reference by curating customized genomic data sets to study therapeutic candidates for specific phenotypic conditions.

DNA-level susceptibility to obesity and T2D

Many early approaches to identify genetic effects on obesity and T2D were GWAS. GWAS observes known or candidate single-nucleotide polymorphisms (SNPs) and phenotypes that are related to obesity and T2D, where the statistical association between each SNP and phenotype is evaluated. Based on GWAS, it is possible to identify genes that have or are close to loci that are associated with susceptibility to the studied phenotypes. Unlike rare cases of diabetes with clear genetic drivers, variants at these susceptibility loci can have subtle effects on the function of relevant genes, as previous studies reported rather modest effect sizes of genetic variants on T2D that range from 10 to 35%4, 5. Nevertheless, T2D is known to have a notable genetic basis, as the co-occurrence of T2D in monozygotic twins is significantly higher at ~70% frequency, whereas dizygotic twins showed a frequency of only 20–30%6. In normal populations with susceptibility loci, these subtle effects can generate long-term phenotypic differences in conjunction with other non-genetic, often environmental factors.

Table 1 lists selected popular GWAS that assessed phenotypes related to obesity or T2D. Most consortia or studies are based on a collection of cohorts, and it should be noted that occasionally, some cohorts are included in multiple consortia or studies. Phenotype information available from these individual cohorts may not be completely coherent with each other within a consortium or study. Thus, the phenotypes listed in Table 1 are those that each consortium or study made an effort to generate via coherent collections and analyses. Some consortia or studies directly analyzed the association with disease outcomes (obesity or T2D; DIAGRAM, InterAct, GoT2D, and T2D-GENES), whereas others studied associations with more detailed phenotypes, such as body measurements or fat compositions (EPIC-Norfolk, Fenland, GENESIS7, GIANT, UK Biobank, and UKHLS), lipid profiles (EPIC-Norfolk, Fenland, GENESIS7, GLGC8, InterAct, UK Biobank, and UKHLS), and insulin resistance/sensitivity (Fenland, GENESIS7, and MAGIC). Individual-level genetic data are rarely available except for a few that accept applications; thus, it is difficult to collect individual-level raw genetic data from multiple cohorts together with phenotypic information to conduct an association analysis. However, the analyzed summary statistics of statistical associations between SNPs and phenotypes are often publicly available, where p-values of statistical significance, frequencies in cohorts, and effect sizes are available in general, and this information is useful for designing and conducting a meta-analysis of interest.

Table 1 Selected popular GWAS related to obesity or T2D

Table 2 lists selected major GWAS publications that assessed genetic associations with phenotypes relevant to obesity or T2D. Studied phenotypes are listed for each work, but it should be noted that most studies consider additional phenotypes for the adjustment of statistical associations or prioritization of associated variants. Most studies are meta-analyses that utilize multiple cohorts from several consortia or studies. A general approach of these meta-analyses is to identify novel loci with susceptibility by increasing the size of population with multiple cohorts or by providing independent evidential support for the identified novel loci by using extra cohorts as independent validation data. Another approach of meta-analysis is systematically integrating the results of multiple GWAS of various phenotypes to model certain types of conditions or diseases. A good example of this type of meta-analysis is the work by Lotta et al.9, where they identified candidate loci that are associated with lipodystrophy-like phenotypes by integrating the results of several GWAS consortia. Most studies provide a list of identified loci, and some studies also provide more detailed summary statistics through their related consortia. A novel meta-analysis of GWAS can be designed to study genetic loci susceptible to specific combinations of phenotypes by integrating GWAS summary statistics that were derived from analyzing associations with individual phenotypes.

Table 2 Selected major GWAS publications related to obesity or T2D

In addition to GWAS-derived data sources from individual consortia or studies, there are online data resources in which previous GWAS results are curated and can be accessed with user-friendly interfaces. NHGRI-EBI GWAS Catalog10 provides searches and visualization of published SNP-trait associations and bulk download of its contents for systematic analysis. It currently contains 63,205 unique SNP-trait associations from >3200 publications, and it contains GWAS on phenotypes other than obesity and T2D. Type 2 Diabetes Knowledge Portal11 is a T2D-focused online data portal in which 22 GWAS/exome chip/whole genome sequencing/exome sequencing data sets are curated with association information for 47 traits. It provides user interfaces that can simulate the systematic integration of multiple GWAS with various phenotypes, where users can search for variants of interests in individual GWAS data sets from participating consortia and form combinations. However, it does not provide bulk download of entire integrated data sets. These data portals provide the functionality of various searches on diseases, genes, phenotypes, or variants.

Available GWAS results cover associations with various phenotypes that are related to obesity or T2D, most of which belong to one of four categories: insulin resistance/sensitivity-related phenotypes, lipid profile-related phenotypes, outcome of obesity, and outcome of T2D. For a better understanding of gene coverages that are associated with these phenotypes, genes that have been associated with any of the four phenotype categories were collected from the NHGRI-EBI GWAS Catalog10. Specifically, the bulk GWAS result data of all 63,205 SNPs that have ever been reported to be associated with phenotypes were obtained, and SNPs that were associated with phenotypes of at least one of the four categories were collected. For each SNP with such an association, a gene that includes the SNP was determined to be associated with the corresponding phenotype, or a gene that is closest to the SNP was determined to be associated if the SNP was in an intergenic location. Fig. 1 shows a Venn diagram of the 2375 genes that are associated with at least one of the four obesity/T2D-related phenotype categories. A certain degree of common genes is shown, but each phenotype category has its own genes of exclusive associations. The six genes that show associations with all four categories of phenotypes include the well-known peroxisome proliferator activated receptor gamma (PPARG), where PPARG is a regulator of adipocyte differentiation12 and has been implicated in numerous diseases, including obesity13 and T2D14. Another gene is peptidase D (PEPD), and it is known to play an important role in collagen metabolism15.

Fig. 1
figure 1

Genes that were ever reported to be associated with phenotypes relevant to obesity or T2D, which were assessed from NHGRI-EBI GWAS Catalog10

As already mentioned, the direct effect size of GWAS-identified loci to obesity/T2D-related phenotypes is relatively small. It should be noted that the genes related to GWAS-identified loci imply the biological functions of certain roles in developing metabolic disorders rather than these genes being decisive disease drivers. For this reason, considering the genes from GWAS generally requires further direct validation of the mechanisms that drive these metabolic disorders.

Causal gene identification with gene KO mouse models

GWAS takes a passive observational approach that searches for associations between the phenotypes of interest and genetic variants in real populations. For this reason, it is challenging to uncover specific mechanisms of action from the identified susceptible loci as they can explain marginal effect sizes in general. In comparison, understanding the function of genes by knocking them out in model species and observing the resulting phenotypes is an extreme interventional approach. In this approach, knocking-out each gene is done for model species and the resulting phenotypes are observed based on predefined protocols. A good example of this approach is the International Mouse Phenotyping Consortium (IMPC)16, where the objective is producing KO mouse lines for >20,000 known genes and observing various resulting phenotypes with standardized protocols. It is an international consortium of multiple institutions, and these institutions produce germ line transmissions of targeted KO mutations in embryonic stem cells for known/predicted mouse genes. Each mutant mouse line is tested through a standardized primary phenotyping pipeline (see the website of the consortium for a complete list of studied phenotypes) in all major adult organ systems and most areas of major human disease. Briefly, phenotypes are observed from embryonic status until the 16th week and include fatality, body measurements and compositions, metabolic profiles, insulin-related phenotypes, pathological, physical, and physiological phenotypes. It is an ongoing project, and the current release (Release 6.1) includes phenotype information from knock outs of 3371 mouse genes. IMPC provides online search functionality for genes, diseases, and phenotypes, and detailed phenotype information is provided if available for queried KO models.

Among the studied phenotypes from IMPC, phenotypes relevant to obesity or T2D can also be grouped into the following three categories: insulin resistance/sensitivity-related phenotypes, lipid profile-related phenotypes, and obesity-related phenotypes, such as weight changes. Among the 3371 studied IMPC genes, genes that showed statistically significant changes in phenotypes that belong to any of the three categories were assessed from IMPC Release 6.1. Fig. 2 shows the Venn diagram of 856 genes that caused these statistically significant phenotypic changes for each phenotype category. Like the case of GWAS-identified genes, genes from KO-based phenotyping studies also show a certain degree of overlap and unique genes in each phenotype category. There are 30 genes that show changes in all three phenotype categories, and they include previously known genes involved in energy transfer and metabolism. CHN1 is a GTPase-activating protein17, BNIP2 is related to myogenesis18 and GTPase activator activity19, and HBS11L and GIMAP620 are related to GTP binding. NCOA1 is involved in controlling the energy balance between white and brown adipose tissues21. CYP17A1 and CYP27B1 are members of the cytochrome P450 superfamily of enzymes22, and they are monooxygenases that catalyze many reactions involved in drug metabolism and the synthesis of cholesterol, steroids and other lipids. LEPR is a receptor for leptin and is involved in the regulation of fat metabolism23.

Fig. 2
figure 2

Genes that showed statistically significant phenotype changes after KO from IMPC (based on Release 6.1)

The advantage of this KO-based phenotyping approach is its direct observation of resulting phenotypes from individual gene KO, which minimizes the undesirable effects of other factors in analyzing the biological function of the target gene. However, there are a few challenges with this approach. Establishing KO mouse models itself is a challenging task, often requiring significant time and effort. Controlling the quality of the standardized phenotyping protocol can also be a technical obstacle, especially when multiple independent organizations collaborate internationally. There is also an inherent limitation that lethal genes are hard to study with this approach, as KO of these genes will disable producing adult mouse lines and the following phenotyping processes. In addition to such challenges in a KO-based phenotyping approach, a few characteristics should be noted before utilizing the phenotyping results of gene KO. Current phenotyping protocols are focused on identifying phenotypes in normal environments (for example, feeding normal chow); thus, these studies do not represent possible phenotypic changes under certain environmental stresses of interest (for example, a high fat diet) that were not considered in the phenotyping protocols. As this approach is conducted based on model species, potential discrepancies between the model species and humans should be considered. Another issue is that this approach performs KO of genes in the whole body rather than tissue-specific silencing, whereas in realistic situations, several relevant organs can have individual roles via specific biological functions in developing metabolic disorders. Thus, consideration of the genes from KO-based phenotyping studies requires an understanding on these pros and cons and their relationships with human disease mechanisms.

Human gene expression profiling of obesity and T2D

A metabolic disorder is a condition in which the dynamic status of in vivo metabolism falls into disorder throughout the body (for example, insulin-resistant state of T2D). Thus, developing effective therapeutic approaches can require an understanding of the exact dynamic states of metabolic systems within the body of individual patients. This understanding of exact dynamical states of in vivo metabolic systems can require the following considerations. First, comprehensive molecular profiling is necessary to form broad multi-omic observations, including gene expression, protein expression, and metabolic profiles. Second, this comprehensive molecular profiling needs to be conducted on various relevant organs, such as adipose, liver, and muscle to study insulin resistance. However, gene expression profiling is the only relatively popular approach for high-throughput molecular profiling due to its advantages of higher reliability and lower costs than the other techniques. There are also certain challenges in acquiring the human tissue samples needed for molecular profiling as surgical treatment is not a general treatment for obesity or T2D. For these reasons, few studies are currently available that have conducted comprehensive molecular profiling in various relevant organs, even when only gene expression is considered.

Nevertheless, some studies have conducted gene expression profiling in specific organs in certain conditions of interest. Like the case of GWAS with various phenotypes, appropriate integration of these data sets can enable data set assessment in a way that mimics comprehensive multi-organ profiling. To integrate multiple gene expression profiles from independent studies, normalization of data sets between data sets is required to achieve data-level coherency. The most desirable normalization of data sets requires all data sets to be generated from the same platform; however, gene expression profiling has been performed with various microarray and next-generation sequencing platforms. There are many different platforms for gene expression profiling, but the most popular platform with the largest number of studies is the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray, despite recent advancements in next-generation sequencing platforms. Table 3 lists the studies on obesity or T2D with available gene expression profiles based on the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray. Most studies profiled samples of only one tissue, except for two data sets (GSE13070 and GSE41168). The approaches of the studies vary, such as studying gene expression profiles of disease only, comparing disease profiles with normal control profiles, comparing profiles across different stages of disease, comparing profiles before and after certain interventions, and comparing profiles from siblings or twins to reduce the effect of genetic backgrounds. From this collection of expression profiles of various conditions performed with the same profiling platform (as listed in Table 3), gene expression profiles from multiple studies can be integrated into a single normalized data set so that the subject conditions of the studies match our conditions of interest.

Table 3 Gene expression data sets studying obesity or T2D generated with Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray

As a simple example of integrating gene expression profiles of several studies with subjects of interest, differentially expressed genes (DEGs) between lean healthy subjects and obese healthy or obese diabetic subjects were identified in a tissue-specific way. From the 20 data sets listed in Table 3, 17 studies (except for E-TABM-325, GSE27916, and E-MTAB-1895) provide BMI information and metabolic profiles or insulin resistance/sensitivity information. A total of 602 gene expression profiles of adipose, liver, and muscle samples from the 17 studies were integrated into a single data set, where lean/obese conditions of the samples were determined based on BMI and healthy/diabetic conditions of the samples were determined based on the metabolic profiles and insulin resistance/sensitivity information. For each tissue type, a gene was declared as a DEG if it showed more than a 1.5-fold change in expression with an FDR-adjusted p-value < 1E-6 (t-test) between lean healthy samples and obese/diabetic samples. Fig. 3 shows the Venn diagram of 2334 DEGs identified from three tissue types. Due to tissue-specific gene expression, many DEGs are differentially expressed in a tissue-specific manner. For example, PPARG is an adipose-specific DEG, which is a regulator of adipocyte differentiation. There are 34 common DEGs that show differential expressions from all three tissue types. Five of these 34 DEGs are known to be related to metabolism or mitochondria. FAHD1 is related to tyrosine metabolism and a mitochondrial enzyme24, and THRSP is related to regulation of lipid metabolism and lipogenesis25. DNAJC1526 is a negative regulator of the mitochondrial respiratory chain, prevents mitochondrial hyperpolarization states and restricts mitochondrial generation of ATP, MRPS10 is a mitochondrial ribosomal protein, and LIAS is localized in mitochondria and known to be associated with hyperglycinemia27. Note that they are DEGs common to all tissue types, and the relevance to mitochondria and metabolism may not be tissue-specific. Compared to DNA-level genetic variants, which make a relatively small contribution to effect sizes, DEGs of significant expression changes from phenotypes of interest can imply more direct representation of the biological mechanisms that drive such phenotypes because these expression changes are a snapshot of the current biological dynamic status. Thus, searching therapeutic targets based on gene expression profiles may provide higher chances of identifying points of intervention compared to searching solely based on DNA-level susceptible genetic variants. However, it should be noted that gene expression profiles are based on transcription profiles; thus, they have their own limitations. First, there can be discrepancies between transcription-level activities and protein levels or metabolic activity levels, as there are many post-transcriptional regulatory mechanisms, such as small RNA activities. Second, identifying key driver events of these transcriptional changes is still a challenge. Nevertheless, publicly available gene expression profiles from relevant studies of obesity and T2D are important and beneficial resources as they provide unique information on dynamical gene-regulations that cannot be inferred from DNA-level phenotype associations.

Fig. 3
figure 3

Tissue-specific DEGs between the lean healthy group and obese/diabetic group

Comparing biological coverage of GWAS, KO-based phenotyping, and gene expression profiles

To compare the coverage of obesity/T2D-related genes that can be identified from currently available data from GWAS, KO-based phenotyping, and gene expression profiles, the genes that were identified from different data types were compared to one another. Fig. 4 illustrates the Venn diagram of the obesity/T2D-related genes that were identified from each data type in the previous sections and the amount of overlap between them. The identified genes show very little overlap between different data types, where DEGs from gene expression profiles show significantly low overlap with the other two data types (p-value of low overlap: DEG–GWAS = 7.73E-17, DEG–IMPC = 0.026). The overlap between the genes identified from GWAS and the KO phenotyping study is also very low, but its statistical significance is not as strong as the other cases. This low commonality between the obesity/T2D-related genes from different data types suggests that their different approaches to assessing the relationships between genes and phenotypes cause biases in the coverage of identified genes. The discrepancy is clearer between the DEGs from gene expression profiles and the genes from the other two data types, suggesting that gene expression-level changes and DNA-level genetic effects may cover different biological aspects. This difference in coverage between the results of studying gene expression profiles and the results of studying DNA-level genetics becomes more evident when their enriched biological functions are compared one another. For each list of genes identified from studies of gene expression profiles, GWAS, and KO-based phenotyping, the statistical enrichment levels of known biological functions were evaluated to identify the most strongly relevant biological functions for each list of genes. Molecular Signatures Database28, 29 is a collection of annotated gene sets, where 17,774 gene sets are curated with a related list of genes (Molecular Signatures Database v5.2). Among these data, each of the 6659 gene sets that represent known biological pathways (curated from pathway databases, such as KEGG30 and REACTOME31, 32) and Gene Ontology33, 34 biological processes and molecular functions was evaluated for its overlap with each list of genes identified from gene expression profiles, GWAS, and KO-based phenotyping, and the statistical significance of overlap was computed as a hypergeometric p-value. For the list of genes from each data type, biological functions with an FDR-adjusted p-value < 1E-10 were declared as the most strongly relevant functions, and Fig. 5a shows the Venn diagram of the most strongly relevant biological functions for the three data types. The biological functions that are very strongly enriched in the genes that showed obesity/T2D-related phenotypes from KO-based phenotyping (IMPC) were mostly discovered by other data types except for one function, whereas 37 biological functions were discovered by both GWAS and gene expression profile-based analysis, and three functions were also discovered by gene expression profile-based analysis. The biological functions from gene expression profile-based studies show large discrepancies with those from GWAS, which strongly implies differences in the biological coverage of gene expression profiles and DNA-level genetic susceptibility information. Fig. 5b illustrates the very strongly enriched biological functions for different data types that are relevant to obesity/T2D, and it shows different biological mechanisms that are specifically enriched in DEGs from gene expression profiles. From Fig. 5b, the list of genes from gene expression profiles, GWAS, and KO-based phenotyping commonly have strongly enriched biological functions that are related to metabolism, differentiation, homeostasis, and lipids. However, biological functions that are related to muscle, immune, catabolism, cytokine, epigenetic modification, and inflammation are specifically enriched in the genes from gene expression profiles in general. This finding implies that genes involved in such biological functions are more affected by dynamic gene expression changes than by static genetic backgrounds. These results emphasize that we need to consider all discrepancies in gene coverage and biological functions that can be identified with different data types in searches for therapeutic targets and strategies.

Fig. 4: Genes that were identified from each data type and their overlaps.
figure 4

Hypergeometric p-values for lower overlapping amount are given for each overlap

Fig. 5: Biological functions that are very strongly enriched (FDR-adjusted p-value < 1E10-10) in the list of obesity/T2D-related genes from each data type
figure 5

. a The Venn diagram of the very strongly enriched biological functions. b Very strongly enriched biological functions for at least one data type, while being relevant to Obesity/T2D-relevant. For each biological function, its related functional categories are also presented

Conclusion

Many efforts to understand obesity and T2D and find their therapeutic targets have been made. However, few data resources exist with comprehensive high-throughput molecular profiles for obesity or T2D whereas such comprehensive molecular information is essential for understanding these conditions. In this review, publicly available genomic data resources of obesity and T2D are discussed, covering major GWAS, a KO-based phenotyping study, and studies with gene expression profiles based on a popular microarray platform. While no comprehensive data resource is available, systematic integrations of these individual data sources based on their associated phenotypes and experimental conditions give us a chance to mimic comprehensive collections of genomic data. GWAS and the KO-based phenotyping study provided insights into the function of individual genes, whereas gene expression profiles provided complementary opportunities to observe dynamical systematic changes of biological functions that could not be observed with DNA-level information. A comparison of obesity/T2D-associated genes that were identified from different data types showed different coverage of identifiable genes, and a comparison of their enriched biological functions provided stronger clues into the biological discrepancies that can be recognized with different data types. Thus, utilizing these data resources for own studies with specific disease models requires the consideration of such discrepancies in data characteristics and coverage.

From this point of view, a desirable approach to building a comprehensive molecular profile for obesity or T2D requires consideration of the following. First, a cohort must be broadly collected so that it can represent various ranges of metabolic conditions as metabolic conditions, such as obesity or T2D, are continuously developed with varying states of metabolic dynamics. Second, a comprehensive collection of phenotypes must be monitored to precisely model the progression status of metabolic conditions. Third, a collection of tissue samples for relevant organs must be collected from individuals in the cohort as several organs participate in the development of metabolic conditions. Lastly, efforts should be put towards making the molecular profiles of tissue samples as comprehensive as possible by covering various levels of molecular mechanisms, including information at the DNA, transcript or gene expression, epigenetic, protein, and metabolic profile levels. Such comprehensive molecular profiling from human multiple organs (if possible) or even organs from model species will give us information on molecular activities in obesity and T2D with an unparalleled level of resolution, and this rich information will become a solid basis for searching for therapeutic targets and developing treatment strategies.