Abstract
Treating messenger RNA transcript abundances as quantitative traits and mapping gene expression quantitative trait loci for these traits has been pursued in gene-specific ways. Transcript abundances often serve as a surrogate for classical quantitative traits in that the levels of expression are significantly correlated with the classical traits across members of a segregating population. The correlation structure between transcript abundances and classical traits has been used to identify susceptibility loci for complex diseases such as diabetes1 and allergic asthma2. One study recently completed the first comprehensive dissection of transcriptional regulation in budding yeast3, giving a detailed glimpse of a genome-wide survey of the genetics of gene expression. Unlike classical quantitative traits, which often represent gross clinical measurements that may be far removed from the biological processes giving rise to them, the genetic linkages associated with transcript abundance affords a closer look at cellular biochemical processes. Here we describe comprehensive genetic screens of mouse, plant and human transcriptomes by considering gene expression values as quantitative traits. We identify a gene expression pattern strongly associated with obesity in a murine cross, and observe two distinct obesity subtypes. Furthermore, we find that these obesity subtypes are under the control of different loci.
Liver tissues from 111 F2 mice constructed from two standard inbred strains, C57BL/6J and DBA/2J, were profiled using a mouse gene oligonucleotide microarray. The expression values from these experiments were treated as quantitative traits and carried through a linkage analysis using evenly spaced markers across the autosomal chromosomes. Of the 23,574 genes represented on the microarray, 7,861 were detected as significantly differentially expressed (type I error = 0.05) in the parental strains or in at least 10% of the F2 mice profiled. Using standard interval mapping techniques, quantitative trait loci (QTL) with log of the odds ratio (LOD) scores greater than 4.3 (P-value < 0.00005) were identified for 2,123 genes, with a maximum LOD score of 80.0 (P-value
10-20). On average, gene expression QTL (eQTL) with LOD scores greater than 4.3 explained 25% of the transcription variation of the corresponding genes observed in this F2 set, with this percentage increasing to nearly 50% for LOD scores greater than 7.0. Applying a simple Bonferroni correction to a 0.05 genome-wide significance level for each trait, given that 7,861 traits were tested, we would expect only 393 false positives at a LOD score threshold of 4.3.
Without filtering based on significant differential expression over the set of mice profiled, we detected 4,339 eQTL over 3,701 genes with LOD scores greater than 4.3, 11,021 genes with at least one eQTL with a LOD score greater than 3.0, and 17,415 eQTL with LOD scores over 2.0. The number of eQTL with LOD scores exceeding 7.0 (P-value
10-7) increased by 50% when genes that were not detected as significantly differentially regulated, as defined above, were considered. This result indicates that although individual tests of hypotheses on the differential regulation of a single gene may not be significant, viewing the behaviour of that gene by genotype over 111 animals provides significantly more information on the biological activity of that gene, an observation also noted for yeast3. Of the 965 genes with LOD scores greater than 7.0, 157 had a maximum fold change of 3.0, indicating a class of genes whose high LOD scores reflect tight transcriptional control (small variance), not large expression differences.
Figure 1a plots the percentage of eQTL at different LOD score thresholds across 920 evenly spaced bins, each 2 cM wide, covering the mouse genome. The number of eQTL in each bin was divided by the total number of eQTL and plotted. eQTL hotspots are apparent on chromosomes 2, 6, 7, 9, 10, 16 and 17, where for each of these hotspot locations, greater than 1% of the total number of eQTL identified genome-wide localize to a 4-cM window. With 460 4-cM windows over the 19 autosomal chromosomes, the probability that greater than 1% of the eQTL would localize to one such window is less than 1.0
10-16. Furthermore, gene expression is seen to be a complex trait, given that 40% of genes with at least one eQTL with a LOD score greater than 3.0 had more than one eQTL, and close to 4% of such genes had more than three eQTL. It is the clustering of eQTL to specific loci and the relationship between these genes with respect to expression, rather than single eQTL significance levels, that when taken together lead to highly significant and interesting patterns that can be associated with phenotypes related to common diseases (see below).
Figure 1: Murine gene expression quantitative trait loci (eQTL) distributions and the molecular basis for fat pad mass (FPM) in a murine F2 cross.

a, Percentage of eQTL in 2-cM bins spanning the murine autosomal chromosomes at two LOD score thresholds. b, Colour matrix display for hierarchically clustered genes (x axis) and extreme FPM mice (y axis). Dark/light blue bars indicate mice in the upper/lower half of the high FPM group; dark/light orange bars indicate mice in the lower/upper half of the low FPM group. See text for a definition of the five arrows. c, d, Chromosome 2 (c) and 19 (d) log of the odds ratio (LOD) score curves for the FPM trait. QTL analysis was performed on three different sets: (1) black curves, all F2 mice; (2) blue curves, F2 mice classified as high FPM group 1 and low FPM group; (3) red curves, F2 mice classified as high FPM group 2 and low FPM group.
High resolution image and legend (128K)Of the 18,460 genes that could be mapped to a chromosome position using the Celera Mouse Genome Database, 3,007 had eQTL with LOD scores greater than 4.3, and 784 had eQTL with LOD scores greater than 7.0. Approximately 34% of the mapped genes with eQTL exceeding 4.3 had a physical location coincident with the eQTL position, whereas 71% of the mapped genes with eQTL exceeding 7.0 had a physical location coincident with its eQTL position. The trend observed here is that eQTL with high LOD scores are cis-acting in most cases, whereas moderately significant QTL are trans-acting in most cases. This is consistent with our expectation that first-order effects (DNA variations in a gene that affect transcription of the gene itself) are easier to detect than second-order effects (genes acting on the transcription of other genes).
Figure 2 highlights a range of interesting gene-centred polymorphisms known to exist between the DBA and B6 strains (cis-acting transcription regulation). The C5 gene (NM_010406) has a 2-base-pair (bp) deletion in the coding region in DBA mice that leads to rapid decay of the transcript2, compared with B6. A LOD of 27.4 centred over the C5 gene on chromosome 2 is readily detected (black curve). The Alad (AK002300) gene is present in two copies in DBA and one copy in B6 (ref. 4). A LOD of 9.3 centred over the Alad gene (red curve) represents the differential dosing that occurs between the two strains. The St7 (NM_022332) gene is differentially spliced at several locations5, and, for a stable splice form at the 3' location of the gene, the probe for this gene fortuitously overlapped the region alternatively spliced out in DBA, but not in B6. The differential splicing event is detected by the major QTL (LOD score of 20.1) for St7, which is centred over the St7 gene (blue curve) on chromosome 6. Finally, the Nnmt (NM_010924) gene, important for drug metabolism, has been identified as being polymorphic with respect to transcription between the DBA and B6 strains6. This polymorphism is confirmed by a major QTL (LOD score of 15.3) for Nnmt, centred over the Nnmt gene (green curve) on chromosome 9. We note that single nucleotide polymorphisms (SNPs) covered by 60-residue oligonucleotide probes would not be expected to significantly affect transcript abundance measurements among these samples7.
Figure 2: Cis-acting eQTL identified for several DNA polymorphisms known to induce transcription polymorphisms.

The LOD score curves over the murine autosomal chromosomes are shown for the expression traits of four genes: (1) black curve, C5 (NM_010406); (2) red curve, Alad (AK002300); (3) blue curve, St7 (NM_022332); and (4) green curve, Nnmt (NM_010406). The alternating green and blue strip at the base of the curves represents chromosome boundaries, starting with chromosome 1 and ending with chromosome 19.
High resolution image and legend (43K)Next we investigated a way of combining gene expression, genetics and clinical trait data to elucidate the complexity of common diseases. Major loci controlling complex phenotypes such as obesity may potentially affect scores of genes, if not hundreds. We would expect those genes involved in the more downstream aspects of pathways associated with common diseases to have eQTL linked to the major causative loci for those diseases. In addition, there may be heterogeneity among the causative loci for a given disease in a population of interest. When present, this heterogeneity impacts the ability to detect linkages to the causative loci, as the significance of any one locus is diminished when the population is considered as a whole.
Because the mice described above had been on a high-fat, atherogenic diet for 4 months before livers were profiled8, they model the spectrum of obesity, diabetes and atherosclerosis in a natural population. The 280 genes depicted in the two-dimensional cluster in Fig. 1b were selected as the most differentially expressed set of genes in mice comprising the upper and lower 25th percentiles of the subcutaneous fat-pad-mass (FPM) trait. When clustering on this set of genes over the high/low FPM mice, the mice cluster almost perfectly into high FPM and low FPM groups (Fig. 1b). Furthermore, there seem to be two distinct expression patterns for mice in the high FPM group, indicating some degree of heterogeneity in the high FPM mice.
Arrows in Fig. 1b highlight five regions in Fig. 1a where more than 50% of the genes in the FPM set genetically link. Taking into account the eQTL distribution for all genes highlighted in Fig. 1a, each of the five hotspot regions defined by the FPM cluster are very significantly enriched for eQTL for the genes in the FPM set. For instance, 25% of the genes in the FPM set have eQTL in a 10-cM region on chromosome 19. This represents an almost fivefold increase over what would be expected by chance, given that roughly 5.5% of genes from the genome-wide gene set have eQTL to this same 10-cM region.
The patterns in Fig. 1b serve to define the obesity trait, FPM, beyond what would be possible without the expression data. There are clearly two distinct patterns associated with high FPM mice. For the FPM trait, a genome-wide scan revealed four classical QTL (clinical QTL or cQTL) with LOD scores greater than 2.0 (ref. 8). To further elucidate this clinical trait, the 111 F2 animals for which clinical and gene expression data existed were classified into one of the three groups shown in Fig. 1b. Subsequently, separate genetic analyses were performed on two sets of animals: (1) those classified as high FPM group 1 or low FPM; and (2) those classified as high FPM group 2 or low FPM. Figure 1c, d depicts the results of these analyses for two chromosomes. The chromosome 2 FPM QTL was the largest of the four QTL originally identified for this trait, when all animals were considered together. This QTL vanishes when considering the high FPM group 1 with the low FPM group and increases by almost 2 LOD units over the original when considering the high FPM group 2 with the low FPM group. Figure 1d depicts another interesting locus for which the original analysis on the full set of mice yielded no significant QTL for the FPM trait, but where the high FPM group 2 considered with the low FPM group gave rise to a QTL with a significant LOD score, whereas the high FPM group 1 considered with the low FPM group was less significant than that of the full set. These results indicate that the chromosome 2 and 19 QTL each significantly affect only a subset of the F2 population, a form of heterogeneity that clearly demonstrates the complexity underlying traits such as obesity.
We next focused our efforts on objectively identifying candidate genes for common diseases. An expanded view of the clinical traits and a portion of the gene expression traits linking to the chromosome 2 locus discussed above is given in Fig. 3. Notably, a group of major urinary protein genes represented in the Fig. 1b FPM cluster are linked to the chromosome 2 locus, in addition to seven other loci (all with a LOD score exceeding 2.0), four of which localize together with adiposity or FPM traits. The Mup1 gene stands out because it was the most highly correlated or had eQTL co-localized with eQTL from many other genes known to be involved in obesity-related pathways, including retinoid X receptor (Rxr)-
, Rxr-interacting protein, acyl-coenzyme A oxidase 1, leptin receptor, peroxisome proliferator activated receptor (Ppar)-
and Lpr6. This demonstrates that the chromosome 2 locus draws together adiposity, FPM, cholesterol and triglyceride levels and is linked to genes with proven roles in obesity and diabetes. Furthermore, the Mup genes are members of the lipocalin protein family, and although they are known to have a central role in pheromone-binding processes that affect mouse physiology and behaviour9, variations in Mup expression have been associated with variations in body weight10, bone length10 and levels of very-low-density lipoprotein11.
Figure 3: Clinical QTL (cQTL) for obesity-related traits localize together with eQTL.

LOD score curves for four obesity-related traits are shown: (1) blue curve, subcutaneous FPM; (2) green curve, perimetrial FPM; (3) red curve, omental FPM; (4) orange curve, adiposity; (5) thin black curve, joint LOD score curve for these four clinical traits considered simultaneously. LOD score curves for four candidate genes that may explain the obesity cQTL are also shown (thick black curves).
High resolution image and legend (32K)Most of the genes linked to the chromosome 2 locus do not physically reside on chromosome 2, and so, are at least partially regulated by one or more loci in the chromosome 2 hotspot region. Of the 423 genes linked to the chromosome 2 locus for which we have chromosome mapping information, there are only four eQTL with LOD scores greater than 3.0 that correspond to genes whose physical locations are within 2 cM of the peak. However, only two of the four genes (NM_025575 and NM_15731) had significant genetic interactions with the subcutaneous FPM trait. Gene NM_025575 codes for a dolichyl-diphospho-oligosaccharide-protein glycosyltransferase and gene NM_015731 codes for a cation-transporting ATPase; these genes may be considered as primary causative candidates for the linkage activity at the chromosome 2 locus.
The region supporting the chromosome 2 locus is homologous to human chromosome 20q12-q13.12, a region that has previously been linked to human obesity-related phenotypes12, 13. The human orthologues for genes NM_025575 and NM_015731 highlighted in the Fig. 3 reside in the human chromosome 20 region. Although other genes, such as melanocortin 3 receptor (Mc3r), have been suggested as possible candidates for obesity at this locus13, our data suggest that the genes NM_025575 and NM_015731 may be responsible for the underlying QTL, as discussed above. We note that expression levels of Mc3r are not linked to the chromosome 2 locus, and there were no SNPs annotated in the exons or introns of this gene between the C57/BL6 and DBA/2J strains in the most recent build of the Celera RefSNP database, further suggesting that Mc3r may not be the chromosome 2 QTL.
To pursue further our survey of the heritability of gene expression, we focused on Zea mays, a classical genetic organism that has been studied intensely and where previous reports have demonstrated that protein14, enzymatic15 and metabolite16 levels could be associated with regions of the genome that control complex traits. QTL results from treating gene expression in ear leaf tissue as a quantitative trait in 76 F2-derived F3 progeny constructed from two typical inbred lines of maize—a stiff stalk synthetic type and a Lancaster type—largely parallel those in the mouse described above. Of the 18,805 genes detected as being significantly differentially regulated (type I error = 0.05) in at least 10 samples, there were 6,481 genes with at least one eQTL exceeding a LOD score of 3.0 (with a maximum LOD score of 41.3), and a total of 7,322 QTL overall. Most genes had a single eQTL, and just over 80% of the eQTL exceeding a LOD score of 7.0 were localized together with the physical gene giving rise to the eQTL, when the physical location of the gene could be determined.
Figure 4 represents a new form of genetic interaction. Two genes, each with a significant eQTL on different chromosomes, are seen to have uncorrelated expression values over the 76 samples. However, the genes appear to be interacting by genotype, representing an interaction between genes that is similar to epistasis, but more general because it occurs between the eQTL of different genes. Although interactions such as this (statistically significant at a type I error of 0.05) occurred among fewer than 10% of the genes with significant eQTL (LOD scores greater than 3.0), this form of genetic interaction clearly offers a potentially powerful means to implicate genes as being involved in the same or related pathways. Furthermore, this type of interaction would have been missed by all standard clustering methods applied to the expression data alone, but the relationship was readily detected using statistical genetics models designed to identify interactions among QTL.
Figure 4: Genes with no overall correlation with respect to expression demonstrate interesting patterns of genetic interaction.

The scatter plot shows the mean log10 ratio for two Zea mays genes that are uncorrelated overall, each with a significant eQTL (LOD of 24.3 for the gene on the x axis and 24.9 for the gene on the y axis) falling on two separate chromosomes. Patterns are apparent in the plot despite the overall random correlation, as the four groups in each quadrant of the plot are correlated. The least squares regression line is shown for each quadrant, with the correlation coefficient values and corresponding P-values given in parentheses. EST, expressed sequence tag. HC, Helminthosporium carbonum.
High resolution image and legend (45K)As a preliminary survey into the genetics of gene expression in humans, 56 individuals from four CEPH (Centre d'Etude du Polymorphisme Humain) reference families17 were selected for expression profiling of lymphoblastoid cell lines using a standard human gene oligonucleotide microarray18. The four families, CEPH/Utah pedigrees 1362, 1375, 1377 and 1408, consisted of large 'sibships' along with parents and grandparents. Heritability analysis was performed for gene expression on a subset of 2,726 genes that were significantly differentially regulated (type I error = 0.05) within 8 or more of the 16 pedigree founders. We deemed the sample size too small to perform systematic linkage analysis across all genes. We found that for the differentially expressed genes, 29% had a detectable genetic component (type I error < 0.05). This result offers an important glimpse into the genetics of gene expression in humans, with such a large percentage of genes detected with significant heritabilities in such a small sample of 'normal' individuals. We propose that this group of genes may make good therapeutic targets for complex human diseases, given that their degree of genetic control was so readily identifiable in such a small number of families. A complete list of those genes with a significant heritable component and additional information on tests for gender and age effects are provided in the Supplementary Information.
The identification of eQTL for genes expressed in a handful of tissues begins to provide an insight into the genetic networks that underlie the complexity of living systems. The several hundred genes with LOD scores exceeding 20 in the mouse represent a new class of quantitative traits, with linkage significances not commonly seen before in mammalian systems. The causal nature of genetics allows the anchoring of multiple genes under the common control of single or multiple loci, as shown in Fig. 3, thereby providing roots for the graphs that can more completely depict the complicated network of gene interactions at work in complex phenotypes.
The class of genes discussed for Figs 1b and 3 provide objective evidence that many of the genes co-localized to a single QTL hotspot are associated with the obesity-related traits. The patterns of expression serve to refine the obesity phenotype and allow the enrichment of subpopulations that are homogeneous with respect to the underlying causes of obesity in the population. Identifying such subpopulations has significant consequences for drug discovery, as each subpopulation may be more effectively treated by a compound that targets a pathway specifically associated with the disease in that subpopulation.
Combining gene expression, genotype and clinical data in a segregating population has the potential to affect the more significant rate-limiting steps in the drug discovery process: objectively classifying individuals according to disease subtypes and identifying the drivers of the pathways or the causal factors underlying those disease subtypes. In the past, dissecting complex traits using genetics has met with limited success, and until now gene expression has served as an indirect marker for complex traits. We have demonstrated that the combination of gene expression and genetics data has the potential to overcome these barriers. The addition of gene expression data can be used to refine the disease phenotype, directly implicate pathways and genes comprising those pathways associated with the disease phenotype, and identify the key drivers of the pathways underlying the disease phenotype.
Methods
Preparation of labelled complementary DNA
Lymphoblastoid cell lines from CEPH/Utah pedigree families 1362, 1375, 1377 and 1408 were obtained from Coriell Cell Repositories. Other lymphoblastoid cell lines were established from normal donors by immortalization with Epstein–Barr virus (EBV) as described previously19. Cells were cultured in RPMI 1640 medium containing 15% fetal bovine serum, and penicillin/streptomycin antibiotics (Invitrogen). Cells were maintained in the log phase of cell growth for at least two days and were collected at densities of 0.4–0.9
10-6 cells ml-1.
An F2 intercross was constructed from C57BL/6J and DBA/2J strains of mice. All mice were housed under conditions meeting the guidelines of the Association for Accreditation of Laboratory Animal Care. Mice were on a rodent chow diet up to 12 months of age, and then switched to an atherogenic high-fat, high-cholesterol diet for another 4 months. More details on this cross are described in ref. 8. Parental and F2 mice were killed at 16 months of age. At death, the livers were immediately removed, flash-frozen in liquid nitrogen and stored at -80 °C.
An F3 cross was constructed from standard inbred lines of Z. mays: a stiff stalk strain and Lancaster type20. After the initial cross, the F1 plants were self-crossed to obtain the F2 progeny, and then the F2 were self-crossed to obtain the F3 progeny. The plants were field grown, and at 5 days after flowering, ear leaf tissues were collected from ten F3 progeny for each F2 line. The ear leaf tissues were frozen on dry ice in the field on collection.
The hybridizations for each species were performed in duplicate with fluor reversal. In all cases, total cellular RNA was then purified using an RNeasy Mini kit according to the manufacturer's instructions (Qiagen). For each species, competitive hybridizations were performed by mixing fluorescently labelled antisense RNA (cRNA; 5
g) from each sample with the same amount of cRNA from a reference pool comprising equal amounts of cRNA from related samples (for more details, see Supplementary Information).
Probe selection for the gene expression arrays
The human microarray contained 24,479 non-control oligonucleotide probes for human genes as described previously18. The mouse microarray contained 23,574 non-control oligonucleotide probes for mouse genes and 2,186 control probes. Full-length mouse sequences were extracted from murine Unigene clusters, combined with RefSeq mouse sequences and RIKEN full-length sequences. The maize ear leaf microarray carried 24,473 non-control oligonucleotide probes for genes of interest and 1,287 control probes. More than 50,000 non-control probes were selected from maize and rice. These probes were used on ink-jet microarrays to collect expression data for RNA samples from a range of maize tissues. The top-ranking 24,473 probes were carried forward for use on the ear leaf array.
For all microarrays, to select a probe 60 nucleotides in length for each gene sequence, we used a series of filtering steps, taking into account repeat sequences, binding energies, base composition, distance from the 3' end, sequence complexity and potential cross-hybridization interactions7. All microarrays used in this study were custom ink-jet microarrays fabricated by Agilent Technologies.
Analysis of expression data
Array images were scanned using the Agilent Dual Laser Microarray scanner (Agilent Technologies) and processed as previously described21 to obtain background noise, single-channel intensity and associated measurement error estimates. The colour displays given in panel b of the Fig. 1b show log10 (expression ratio) as: (1) red when the red channel is upregulated relative to the green channel; (2) green when the red channel is downregulated relative to the green channel; (3) black when the log10 (expression ratio) is close to zero; and (4) grey when data from one or both of the channels for a given probe is unreliable.
Linkage and data analysis
Variance components analysis22 was used to estimate the heritability of gene expression, as measured by the mean log10 expression ratio, for each of the 2,726 mRNAs that were significantly differentially expressed in the founders, and to test whether the heritability was significantly different from zero. Genes were defined as differentially regulated if eight or more founders had a P-value for differential expression less than 0.05.
A complete linkage map for all chromosomes except Y in mouse was constructed at an average density of 13 cM using microsatellite markers. Linkage maps were constructed and QTL analysis was performed using MapMaker QTL23 and QTL Cartographer24. Log of the odds ratio (LOD) scores were calculated at 2-cM intervals throughout the genome for each of the 23,574 genes represented on the mouse microarray.
A complete linkage map for all chromosomes in Z. mays was constructed at an average density of 12 cM using microsatellite markers. Linkage maps were constructed based on genotypes from the F2 progeny using MapMaker QTL. Mapping of QTL for gene expression traits was carried out as described previously25. For the interaction result presented in Fig. 4, individual specific LOD score vectors were calculated for each QTL, and correlations were computed between all vector pairs. See the Supplementary Information for more details on the segregation and linkage analyses performed over all species.
