Main

A number of variations of human DNA are likely to alter gene transcription rates. For instance, promoter and splice junction polymorphisms, which can change transcription rates1,2 and RNA stability,3 occur on average every 5.3 and 6.5 bases per kilobase of human DNA, respectively.4 Likewise, such DNA variations can be associated with complex traits/disorders either directly or indirectly through linkage disequilibrium with susceptibility loci. Therefore, differences of transcript abundance between series of cases and controls can be indicative of association to susceptibility loci, rationalizing the use of differences of gene expression level as a surrogate marker for complex traits or disorders.5

The aim of the present meta-analysis was to evaluate whether complex disorder susceptibility loci show differences in gene expression between normal and affected tissues. This question was addressed by meta-analysis of all studies related to well-validated human gene-disease associations that have compared transcript amounts between series of normal and pathologic tissues. The present findings clearly demonstrate that statistical differences in transcript levels of disease susceptibility genes are found between normal and pathologic human tissues. These results rationalize the use of comparative gene expression analysis for gene discovery studies. However, the relatively weak levels of differences in transcript amounts were found, which should be taken into account for the design of gene discovery studies based on gene expression studies.

METHODS

Literature search and inclusion criteria

To search for eligible studies, the MEDLINE citations up to May 2003 using the National Library of Medicine's PubMed online search engine were surveyed. The search was limited to English language literature. For gene-disease association retrieval, I combined the official symbol of the gene as defined by Human Gene Nomenclature or its alternative aliases, plus the name(s) of the disease, plus at least one of the keywords association or polymorphism or allele. For the selection of the gene expression studies, I combined the official symbol of the gene or its alternative aliases, plus the name of the disease, plus one of the keywords: transcript or mRNA or RNA or expression.

The abstracts from the literature search were screened to select relevant studies. The articles were read, and all relevant references cited in these studies were recovered to identify additional works unidentified in the PubMed screen.

Selection of validated gene-disease associations

The analysis is based on the 166 gene-disease associations reported by Hischhorn et al.8 as being tested at least three times in independent studies. These 166 gene-disease associations were related to 93 genes and 43 disorders (Fig. 1). This analysis included all published associations between human common disease or dichotomous trait and common polymorphisms (defined as having a minor allele frequency over 0.01) located within or in the vicinity of the investigated genes. This review excluded the polymorphisms in HLA or blood groups, as well as association studies for substance abuse and laboratory findings. In addition, only the associations between variation at single locus and susceptibility to disease in the entire population were taken into account by Hischhorn et al.8

Fig. 1
figure 1

Flowchart of the data analysis used for this article. A, Analysis focused on the 166 gene-disease associations reported by Hischhorn et al.8 as being tested at least three times in independent studies. These 166 gene-disease associations were related to 93 genes and 43 disorders. B, Then, I searched for gene-disease associations fulfilling the criteria reported by Lohmueller et al.9 as strongly predictive of replication in further allelic case-control association studies. 94 out of the 166 gene-disease associations involving 74 genes and 33 disorders fulfilled these criteria. C, Then, I searched for gene expression studies among these well-validated gene-disease associations. A total of 120 expression studies extracted from 57 articles were found. These expression studies involved 23 gene-disease associations, which were related to 20 genes and 8 disorders; (D) these 120 gene expression studies provided the material used for the present statistical meta-analysis.

I next focused the analysis on gene-disease associations that have been well-validated in the published literature. Therefore, I have searched among the 166 previous gene-disease associations those that fulfilled the criteria reported by Lohmueller et al.9 as strongly predictive of positive association on a sample of 25 gene-disease associations. These criteria were established on the fact that most of the 8 gene-disease associations supported by previously published meta-analysis met those criteria, whereas most of the remaining 17 did not.9 These criteria are, in addition to the initial positive study achieving P < 0.05, at least two additional independent studies with P < 0.01 or a single study reaching P < 0.001. When more than one polymorphism in a gene had been studied, all polymorphisms corresponding to this gene were considered simultaneously. As a result, 94 gene-disease associations involving 74 genes and 33 disorders/traits fulfilled these conservative criteria (Fig. 1 and Table 1). Consequently, some of these genes were associated with two or more diseases (mean, 1.34; range, 1–5), such as TNF, which is involved in Alzheimer disease, obesity, type I diabetes, type II diabetes, and asthma. Interestingly, 33 out of these 74 genes (44.6%) were also found associated to Mendelian-inherited diseases according to the Genatlas database. This strongly suggests that some genes can be responsible at the same time of a rare Mendelian-inherited form of a disorder caused by highly penetrant alleles, but also in their much more common multigenic form through less penetrant alleles. Although the molecular mechanisms associated with these differences of inheritance have not been investigated in detail, the changes due to Mendelian-inherited disorders can reasonably be expected to have more drastic effect on gene function than the variations predisposing to common multigenic forms of the disorder.

Table 1 Gene-disease associations

Gene expression studies

Subsequently, I searched for published gene expression studies among the 94 well-validated gene-disease associations (Fig. 1). All studies based on quantitative gene expression methods were used for analysis, including semiquantitative RT-PCR, competitive RT-PCR, real-time quantitative PCR, RT-PCR ELISA, branched DNA, cDNA or oligonucleotide microarrays, quantitative Northern-blot, dot-blot, RNase protection assay, serial analysis of gene expression (SAGE), and quantitative in situ hybridization. No relevant microarray data were found and used in the present analysis. In some articles, several genes and/or methods of quantification have been assayed simultaneously, the results of which were treated separately. Only the studies providing enough details to recalculate the distribution of expression in case and control groups were selected for analysis. In addition, the studies with less than five controls or five cases were excluded for analysis because poor reliability (n = 7). In total, 120 expression studies extracted from 57 published articles were reviewed (Fig. 1). Consequently, some articles investigated simultaneously the expression of several genes. These 120 gene expression studies were related to 23 gene-disease associations, which correspond to 20 genes and 8 disorders (Table 2). Twelve out of these 23 gene-disease associations were screened for allelic association before gene expression studies, whereas 11 were screened for allelic association following gene expression studies. In 62 out of these 120 expression studies, the values of control and case groups were established using the figures of the original publication as details were lacking in the text of the article. In these cases, each value was assessed blindly at least three times using maximum magnified copies of the figures.

Table 2 Details of gene expression studies comparing normal and pathological human tissues for the 23 well-validated human susceptibility genes to complex disorders/traits (see Table 1)

Statistical analysis

For each gene expression study, the statistical difference between case and control groups was assessed by unpaired two-tailed Student's t-test using untransformed original variables. Statistical comparisons were considered significant at P < 0.05.

To test the effects of various parameters on the gene expression ratio, a univariate analysis by the linear model procedure using type III sums of squares model was performed. Multivariate analysis was achieved by logistic regression model with a forward stepwise search. In both univariate and multivariate analysis, the expression criteria used for analysis was the gene expression threshold below or above 2-fold change between case and control samples.

Statistical analyses were performed with the SPSS 11.5 software (SPSS Inc., Chicago, IL).

RESULTS AND DISCUSSION

The present meta-analysis aimed to determine whether well-validated human complex disorder susceptibility genes show differences in gene expression between normal and pathologic tissues. Therefore, the published literature was first screened to select well-validated gene-disease associations for complex human traits/disorders. The analysis was based on a set of 166 gene-disease associations reported by Hischhorn et al.8 as being studied in at least three different studies. Then, the published literature up to May 2003 was screened to select which of these 166 gene-associations fulfilled the criteria reported by Lohmueller et al.9 as strongly predictive of future replication. A total of 74 genes and 33 disorders involved in 94 gene-disease associations fulfilled these criteria (Fig. 1 and Table 1).

Then, these genetically well-validated genes were analyzed for differential transcript expression in pathologic conditions. In total, 120 expression records involved in 23 gene-disease associations were found (Fig. 1). For each study, the statistical difference of gene expression was calculated by unpaired two-tailed Student t test using the original data set (Table 2). From these 120 gene expression studies, 60 (50%) achieved P < 0.05, which should be compared to 6.0 significant results expected randomly under the null hypothesis of no gene expression change. This result was highly significant, as the nominal value obtained by reference to the χ2-distribution was 513.4 (χ2-test, 1 degree of freedom; P < 10−112). In theory, publication bias toward positive results could explain such difference. However, the number published and unpublished nonsignificant studies required to account for these results should be at least 19-fold higher than the number of significant studies, i.e., 1140 studies (95% CI: 851–1428). Because there were only 60 nonsignificant published studies, one must postulate the existence of 1080 unpublished negative studies (95% CI: 791–1368), which appear frankly unrealistic. The ratio of statistically significant gene expression studies was similar between the genes also known to be involved in Mendelian-inherited disorders or not (27 out of 58 vs. 31 out of 62; χ2-test, 1 degree of freedom: 0.4; P = 0.46). The ratio of statistically significant gene expression studies was also similar for the gene-disease associations tested by association before expression studies or after (20 out of 33 vs. 40 out of 87; χ2-test, 1 degree of freedom: 1.3; P = 0.15). Furthermore, out of the 60 studies leading to statistically significant results, 15 were first reports and 45 were replications of previous studies. Thirty-four of these replications reports showed statistically significant difference in expression in agreement with the original report, whereas only 11 showed a statistically significant change in transcript levels but in the opposite direction to that described in the first report. These figures are significantly different from the 22.5 studies expected to occur randomly in the same and opposite direction (95% CI: 19.0–26.0; χ2-test, 1 degree of freedom: 11.7; P = 6×10−4).

Several sources of bias, related to the heterogeneity of the expression studies presently analyzed, can impact the results of the present meta-analysis. For instance, some susceptibility genes have been tested by allelic association after gene expression analysis, while others were tested before. Although this was not found to impact the results of the present analysis, the frequency of gene expression changes could be overestimated in the subset of genes initially tested for gene expression. Furthermore, the present results could be altered by the fact that some results of gene expression were not independent one from the other, as several samples or genes were analyzed in the same studies. To test this hypothesis, the subset of 27 strictly independent gene expression studies was individually reanalyzed. Positive results were still obtained, as 24 out of the 27 gene expression studies were statistically significant (89%) exceeding the number of 1.35 predicted by chance (P = 10−28, binomial distribution). In summary, although the strength of the present statistical analysis supports the validity of the present findings, they should be taken cautiously and require further confirmation.

Actually, the method based on comparing the expression level between series of cases and controls for human susceptibility genes shows a major limitation: only 36 (60.0%) and 19 (31.7%) out of the 60 statistically significant gene expression studies reached 2- or 3-fold changes in expression level, respectively (Fig. 2). In contrast, only 6 (10%) and 3 (5.0%) out of the 60 non–statistically significant studies reach these respective thresholds. Therefore, 2- or 3-fold change expression thresholds have in the present study a sensitivity of 50% and 23.5%, respectively. Consequently, the reliability of testing difference in mRNA abundance can be seriously impacted by the weak differences in transcript abundance between cases and controls.

Fig. 2
figure 2

Volcano-plot of expression ratio against significance of gene expression studies between case and control samples. Each symbol indicates the results of a published genetic expression study. Ratios of mRNA expression for each gene are reported as log2 of the quotient of case series to control series. Statistical significance of expression between populations is expressed as the log10 of the inverse P value from the unpaired two-tailed Student's t-test. Red squares represent studies with a change of expression over 2-fold (log2 ratio > 1 or log2 ratio < −1) that are not statistically significant (log101/P < 1.3); brown diamonds correspond to studies with expression change over 2-fold cutoff that are statistically significant; blue triangles indicate studies with changes under 2-fold that are statistically significant; and green circles indicate studies with changes under 2-fold that are not statistically significant.

To test the impact of various methodological parameters on gene expression change, I next performed a univariate ANOVA analysis (Table 3). The gene (P = 0.003) and the nature of tissue analyzed (P = 0.05) were associated with gene expression change between cases and controls, whereas the method of assay to measure transcript abundance had no significant effect (P = 0.08). Multivariate analysis by logistic regression model with a forward stepwise search was performed to assess the relative influence of the univariate factors on gene expression ratio. Both two previous factors were found to be independently associated with gene expression ratio, the gene having a strongest effect (P = 0.002), while the type of tissue used for expression analysis was more weakly associated (P = 0.01). Therefore, because of the influence of the gene itself and the tissue analyzed, the reliability of differential expression analysis is expected to vary strongly from one gene or one tissue to another.

Table 3 Univariate ANOVA analysis for the effect of various parameters on gene expression change

Taken together, the present results demonstrate significant differences in transcript levels between normal and pathologic tissues of human susceptibility genes. These results rationalize the use of comparative gene expression analysis for gene discovery studies. However, differences in transcript amounts appear much lower than those typically found between inbred environmentally controlled animal models.6,7 These weak differences should be taken into account for the design of gene susceptibility studies using differences of transcript amounts as a tool749 for gene5099 discovery/validation.100157