|
|
|
| 2001, Volume 1, Number 3, Pages 193-203 |
| Table of contents Previous Article Next [PDF] |
 |
| Original Article |
| Single nucleotide polymorphism identification in candidate gene systems of obesity |
 |
| K Irizarry1,3, G Hu1,3, M-L Wong2, J Licinio2 and C J Lee1,3 |
 |
1Dept of Chemistry and Biochemistry, University of California, Los Angeles, CA, USA
2Laboratory of Pharmacogenomics, Neuropsychiatric Institute, University of California, Los Angeles, CA, USA
3UCLA-DOE Laboratory for Structural Biology and Molecular Medicine, University of California, Los Angeles, CA, USA
|
 |
| Abstract |
 | We have constructed a large panel of single nucleotide polymorphisms (SNP) identified in 68 candidate genes for obesity. Our panel combines novel SNP identification methods based on EST data, with public SNP data from large-scale genomic sequencing, to produce a total of 218 SNPs in the coding regions of obesity candidate genes, 178 SNPs in untranslated regions, and over 1000 intronic SNPs. These include new non-conservative amino acid changes in thyroid receptor beta, esterase D, acid phosphatase 1. Our data show evidence of negative selection among these polymorphisms implying functional impacts of the non-conservative mutations. Comparison of overlap between SNPs identified independently from EST data vs genomic sequencing indicate that together they may constitute about one half of the actual total number of amino acid polymorphisms in these genes that are common in the human population (defined here as a population allele frequency above 5%). We have analyzed our polymorphism panel to construct a database of detailed information about their location in the gene structure and effect on protein coding, available on the web at http://www.bioinformatics.ucla.edu/snp/obesity. We believe this panel can serve as a valuable new resource for genetic and pharmacogenomic studies of the causes of obesity. The Pharmacogenomics Journal (2001) 1, 193-203. |
 |
| Keywords |
 | obesity; SNP; neuropeptides; body weight regulation: appetite; polymorphisms |
 |  |
INTRODUCTION
The draft sequence of the human genome has now been completed to 93% coverage, with an estimated total of 32000 genes.1,2 Of those, 22000 have been identified so far by stricter experimental criteria. This is likely to have profound and far-reaching impact on the genetics, pharmacogenetics, and pharmacogenomics of complex diseases. The most interesting question for investigators today is, where to start? While genome-wide single-nucleotide polymorphism (SNP) mapping has the potential to identify markers associated with treatment responses of common and complex disorders, use of this approach is still in its infancy and has not demonstrated success yet.3 Meanwhile, association approaches have examined associations of specific SNPs or small groups of SNPs to complex traits. With some notable exceptions, those efforts have also not yet borne fruit. An alternative approach would be to identify large batches of SNPs in candidate gene systems, that could be used for genomic and pharmacogenomic studies.4
Obesity provides a logical target for such an SNP discovery effort. Obesity is a common and complex disorder that represents the outcome of gene-environment interactions. In the general population of the US, the prevalence of overweight individuals (body mass index¾BMI¾between 25 and 29.9 kg m-2) is 34.9% among women and 31.7% among men, and it more than doubles between the ages of 20 and 55. The incidence of obesity (BMI equal to or higher than 30 kg m-2) has been conservatively estimated to be at 11%. Among women, obesity is strongly associated with socioeconomic status, being twice as common among those with lower socioeconomic status as it is among those with higher status. Weight gain has been described as epidemic among children and adolescents. The data from the Third National Health and Nutrition Examination Survey (NHANES III 1988-1994) indicate that approximately 14% of children and 12% of adolescents are overweight. Among adults, approximately 33% of men and 36% of women were overweight. Among women, 34% of non-Hispanic whites, 52% of non-Hispanic blacks, and 50% of Mexican Americans were overweight. Racial/ethnic group-specific variation among men was less than that among women.5 Obesity is also a well-recognized major risk factor for heart disease. The American Heart Association (AHA) has declared obesity as a 'major risk factor' and an expert committee of the National Institutes of Health/National Heart Lung Blood Institute established obesity as an independent risk factor for heart disease based on a meta-analysis of over 300 prospectively-controlled randomized trials in literature. A small fraction of obese patients have been identified who have specific Mendelian gene mutations that appear to cause obesity. These include mutations in the leptin, POMC, prohormone convertase 1, and MCR4 receptor genes.6,7,8,9,10,11,12,13
The pharmacogenomics of obesity is a new and understudied area. While some treatments for obesity can have a positive outcome for a minority of patients, it is unknown why some patients respond to pharmacological and/or behavioral interventions, while others do not.14,15,16,17
Considerable progress has been achieved on the neurobiology of obesity. A variety of new and established candidate genes are now known to interact at the central level, particularly in the hypothalamus, to regulate food intake, energy expenditure and body weight. The interactions of those gene products and their role in human energy homeostasis have been one of the most exciting areas of biomedical investigation in the last decade. For reviews see Refs 18-24.18,19,20,21,22,23,24 The Human Obesity Gene Map provides an excellent resource of candidate genes that may be involved in human obesity.19 We have based our work on this extensive listing of candidate genes, to identify polymorphisms that can serve as useful tools for analyzing the genetic components of obesity.
We have constructed a large panel of single nucleotide polymorphisms (SNP) in these candidate obesity genes, combining novel SNP identification methods based on EST data with public SNP data from large-scale genomic sequencing. We believe this panel can serve as a valuable new resource for genetic and pharmacogenomic studies of the causes of obesity. It includes over 1400 SNPs in obesity candidate genes, as well as a database of detailed information about their location in the gene structure and effect on protein coding, which will be available on the web at http://www.bioinformatics.ucla.edu/snp/obesity.
|
 RESULTS
Identification of SNPs in Obesity Candidate Genes
We based our SNP discovery effort on the Human Gene Glossary of the Human Obesity Gene Map (105 candidate obesity genes; http://www.obesity.chair.ulaval.ca/glossary.html), as a well-established set of genes for future genetic and pharmacogenetic studies of obesity.19 We were able to map 75 genes from this set to EST clusters from Unigene by matching gene identifiers. We used these 75 genes for EST-based SNP identification as described in this paper. This set included genes encoding neuropeptides and their receptors, transporters, enzymes and regulatory proteins involved in neurotransmitter synthesis, reuptake, and signaling, as well as metabolic enzymes.
To identify SNPs in this set of genes, we searched through the Unigene database of expressed sequence tags (EST) for overlapping cDNA sequence fragments of these genes from different people. We assembled a total of 9293 sequences (8942 ESTs, 222 mRNAs, and 129 DNAs) for this candidate gene set (Unigene release July 2000), representing 698 libraries. Since most of these libraries were prepared from a single individual's DNA, and a few from multiple individuals, these data represent sequence from more than 700 individuals. Sequences from about nine people on average were available for each gene. From this set of 75 clusters, 45 contained at least one SNP.
Based on this gene set, we identified 152 polymorphisms, including 76 cSNPs causing 51 amino acid changes (Table 1). We obtained the original sequencing chromatogram data for the ESTs, and performed a detailed analysis consisting of multiple sequence alignment of overlapping ESTs for each gene, removal of potential paralogous ESTs, and statistical scoring of candidate SNPs (see Methods). We selected candidate SNPs with a likelihood odds ratio greater than 103 in favor of a polymorphism as opposed to a sequencing error. We limited this study to candidate SNPs with population allele frequency greater than 1%, and at least one EST observation of the polymorphic band with good chromatogram quality (see Methods), or an observation in a curated mRNA sequence.
Our analysis also estimated population allele frequency for each SNP, reporting an expectation value and 95% rank confidence interval (see Table 2). Population allele frequencies for the obesity gene SNPs ranged from 3% to 37% in our data. We did not observe striking patterns within these allele frequencies. For example, allele frequencies for synonymous vs non-synonymous cSNPs did not show statistically significant differences. Unfortunately, our SNP data do not include any ethnicity information, since NIH has mandated that such information be stripped from public EST sequencing data. The dbEST libraries likely represent a cosmopolitan but primarily Caucasian population. Since SNP allele frequencies can vary widely across ethnic groups,25,26 this is a limitation and disadvantage of our dataset.
We have validated a significant fraction of these polymorphisms vs studies of independent human population samples. Candidate SNPs from our EST-based approach have been tested in 8-24 individuals from California (70% validation rate27) or Finland (80% validation rate) (G Hu, unpublished data). For these obesity gene SNPs, we were able to validate 25 of the 51 non-synonymous polymorphisms by comparison with amino acid mutations listed in databases such as SwissProt29 and OMIM. Since these databases by no means contain complete coverage of human polymorphism, 100% validation is not expected. We have also compared our EST-based SNPs for the candidate obesity genes vs polymorphisms independently identified by genomic sequencing.30 Overall, 39% of our obesity-gene SNPs were independently identified by public genomic sequencing, as reported in the dbSNP database.31
Characterization of Obesity Gene Polymorphisms
We have mapped these SNPs to the associated mRNA and protein sequences where possible. Overall, 50% mapped to the protein coding region (cSNPs) (Figure 1). This is dramatically higher than the frequency of cSNP identification from bulk genomic sequencing, which is typically only a few percent.30 We aligned 76 cSNPs to their corresponding amino acid positions in the protein sequences, representing 28 of the candidate obesity genes. The remaining 76 SNPs were in untranslated regions (UTR). We have catalogued the amino acid changes caused by cSNPs in these genes (Table 2).
The distribution of cSNPs in these genes shows evidence of functional selection (Figure 1c). Under a random mutation model, 79% of amino acid replacements should be non-conservative (vs 21% conservative). In our data, only 44% of our detected cSNPs were non-conservative, vs 56% conservative, more than double that expected by random mutation. These data suggest negative selection has occurred against mutations that were more likely to be structurally disruptive.
Indeed, for our non-conservative SNPs, there is already substantial evidence of association with human diseases. Of the 22 non-conservative mutations we identified, a third (7 of 22) have already been reported in OMIM or SwissProt to be associated with diseases such as hypertension, coronary artery disease, plasma glucose levels, and body mass index (Table 3). While this is only evidence of correlation, not causation, this surprisingly high 'hit rate' increases the interest of this class of SNPs, especially since these conditions may be related to obesity. By contrast, we have found a reported disease association for only one of our conservative SNPs (ADRB2 Q27E). The observation that our non-conservative SNPs are more likely to be associated with disease suggests they may themselves have a functional effect.
Combining Obesity Gene SNPs from EST and Genomic Data
We have also combined our EST-based SNP identification results with public SNP data from genomic sequencing30 mapped to the candidate obesity gene set (Table 4). This yields a total of 218 cSNPs in 48 obesity candidate genes, and 178 UTR-SNPs. In addition, genomic sequencing has identified 1022 SNPs in the intronic regions of these genes. Although these polymorphisms are less likely to produce a functional effect, they can still be useful markers for mapping studies based on linkage disequilibrium.
These results demonstrate the highly enriched value of EST-based cSNP discovery, vs conventional SNP discovery from genomic sequencing (Figure 2). Out of the 1.4 million SNPs identified by genomic sequencing, only a few percent fall within genes.30 And of these, only a small fraction are in exons or protein coding regions, as can be seen clearly in the obesity gene set. Of the genomic SNPs mapped inside these genes by dbSNP, just 11% were in coding region and 8% in UTR exons; the rest were intronic (Figure 1b). By contrast, for our EST-based SNP discovery, these numbers were 50% and 50%, respectively. Moreover, despite the huge investment in SNP identification using genomic sequence, and the relatively small fraction of all SNPs that have been contributed by EST-based approaches (about 3.6% of the total), for coding region SNPs, ESTs can make a major contribution. The EST-based cSNPs described in this paper constitute more than a third (35%) of all cSNPs reported for these obesity candidate genes (Figure 2b). For SNPs in untranslated regions of exons (UTR), the corresponding fraction is also high (40%).
How complete is this polymorphism dataset? We have compared the overlap between our results and independent SNP identification from genomic sequencing, to assess the level of saturation of common SNPs by the public SNP identification efforts. Our data can be used as an independent estimator of the completeness of the public SNP data. We compared 148 of our obesity gene SNPs with genomic SNPs from the dbSNP database. Fifty-seven (39%) were independently identified by genomic sequence-based SNP discovery efforts, suggesting that the genomic SNP detection data cover about 40% of the common cSNPs in these genes. Thus, the combined EST- and genomic-based SNP data probably constitute a bit more than half of the total SNPs in these genes that are common in the human population (allele frequency above 5%). Based on the approximately 150 amino acid changes identified so far in the obesity genes, this would suggest a total of about 300 amino acid changes common in the human population, in these genes.
Accessing the Obesity Candidate SNP Database
All of these SNP data (both EST-based and genomic-based) are available as supplementary material online: http://www.bioinformatics.ucla.edu/snp/obesity. In addition, we have deposited our EST-based SNPs on the public dbSNP database.
|
 DISCUSSION
These data provide a novel resource of single nucleotide polymorphisms relevant to obesity research. We have identified common SNPs in a wide range of genes that regulate metabolic rate and/or endocrine signaling, and a variety of other pathways identified in the Human Obesity Gene Map.19 These data includes both a large number of SNPs in protein coding regions (220), and SNPs in untranslated regions of exons (177). Although much emphasis has been placed on SNPs that cause amino acid replacements, non-coding SNPs can also have important functional effects. Our dataset also contains more than 1000 intronic SNPs for these genes, which may be useful for mapping studies.
While many articles have addressed the issue of specific linkages to obesity phenotypes, this is to our knowledge the first attempt to identify a full set of over 1400 SNPs related to obesity, along with information relevant to their functional impact. We hope these data will serve as a public resource to the scientific community and facilitate future genetic and pharmacogenetic work in this area. Moreover, the full functional characterization of these and related SNPs, and their effects on food intake, energy metabolism, and body weight phenotypes, may represent a new direction for genomic research in obesity.
The dataset presented in this paper should be viewed mainly as a source of new SNPs for obesity research, not as a complete catalog of polymorphism for this field. There are several reasons for this. First, the goal of this work was not to incorporate known polymorphisms from the literature. Our dataset is undoubtedly missing many known SNPs that are of interest to obesity researchers, particularly those that are rare in the human population or restricted to particular ethnic groups. Second, the small population samples inherent in sequencing-based high-throughput SNP discovery efforts mean that they primarily detect common polymorphisms. In this study, the average number of individuals represented for a given gene was nine people, giving a poor chance of detecting a SNP whose allele frequency is below 5%. Third, the available data are by no means a complete catalog even of common polymorphisms. Comparison of our EST-based SNP detection with independent SNP detection based on genomic sequencing suggests that the total data probably comprise only about half of the SNPs that are common (allele frequency above 5%). Fourth, these data represent a cosmopolitan population pool without ethnicity information (due to NIH guidelines that forbid such information), and are undoubtedly missing many polymorphisms that are specific to a given ethnic group.
The genetic substrate of obesity appears to involve a combination of genes of smaller and larger effects. Obesity appears to involve many genes. Therefore, a successful strategy to dissect the genetics and pharmacogenetics of this common and complex disorder may require correlating specific haplotypes across multiple genes or chromosomal regions. While our dense map of SNP markers will hopefully provide a useful starting point, there are many unsolved challenges remaining for identifying such multigenic obesity haplotypes.3
|
 METHODS
Sequencing Chromatogram Analysis
Trace data of EST sequences in Unigene were obtained from Washington University (genome.wustl.edu), and processed with PHRED32,33 (generously provided by Phil Green, University of Washington) to produce base calls and quality factors.
To measure representative rates of sequencing error for single-pass reads of EST sequences from dbEST,34 we analyzed multiple sequence alignments of EST clusters, where sufficient data were present to yield a reliable consensus. Specifically, we limited our analysis to regions where at least five sequences were aligned and identical and no more than two other sequences disagreed with this consensus. Moreover, sequence positions which were farther than two residues to the left or right of a position that matches consensus were excluded from the analysis, since these can arise from chimeric or divergent sequences. Approximately 241 million nucleotides of aligned EST reads met these criteria, and were analyzed to give rates of sequencing error, categorizing each base as consensus, mismatch, or insert. Bases present in the consensus but missing in an individual read were counted as deletions.
To construct a statistical model for distinguishing sequencing error from likely SNPs, we constructed a joint probability model conditioned on two separate components: the observed quality factor for a given base, and the true (unobserved) local sequence context around that base. Although the quality factor in a given read is a function of that individual observation, and should be largely independent of the true sequence, we constructed a joint probability model that takes both into account without assuming conditional independence. Each observed base was reported within the context of five nucleotides of the consensus sequence, centered on the observed base, and its PHRED-assigned quality factor. Observations of all possible error states (no error; substitution by A/T/G/C/N; insertion of A/T/G/C/N; deletion) for all sequence contexts and quality factor combinations were counted. Approximately 241 million nucleotides of data were analyzed. Finally, the conditional error rates were smoothed and interpolated in such a way that they converge to the rate predicted by the PHRED quality score32,33 when the PHRED quality score is higher than 40.
Since trace data for all ESTs were not available, a second joint probability model was constructed, conditioning on the true sequence context and a weighted local miscall count¾the number of 'unassigned' bases (ie bases called as 'N' instead of A, G, C or T) within 25 nucleotides of the position under consideration. The proximity of N indicates poor trace quality and is associated with increases in local error rate of up to 100 fold. To construct this model, 346 million nucleotides of EST data were analyzed.
SNP Identification
To identify likely SNPs, single base mismatches were reported from multiple sequence alignments produced by the programs PHRAP (P Green, http://genome. washington.edu) and POA41 for each Unigene cluster. Each mismatch was reported with seven nucleotides of the local consensus sequence context, centered on the mismatch, along with the corresponding segments of each sequence which align with this region. The PHRED quality score (or weighted miscall count) for each observed base was also reported.
To evaluate the strength of the evidence for a SNP at a given position in a gene, we calculated the likelihood-ratio of the observed sequences under a SNP model vs a pure sequencing error model:

For the sequencing error model we sum the probability of the observations over all possible 'true sequence' contexts T, to take into account any uncertainty about whether the consensus T* is actually the true sequence of the gene, which would undercut confidence in the SNP prediction:

The SNP model can be formulated in terms of the true sequence T of the gene (initially unknown), and an 'alternate sequence' T' differing only by a single nucleotide substitution at the central position:

Taking a conservative approach to the SNP model, of allowing only a specific polymorphism T' and the consensus sequence T* as the 'true sequence', we assume that p(T'|T*) = 1/3 and p(T) = constant over the possible gene sequences T, which cancels in the numerator and denominator of the likelihood ratio, giving:

Finally, ignoring the constant factor, we define a log-odds score for ranking candidate SNPs:

To evaluate the sequencing error model, we treat each observed sequence i as an independent observation,

where the summation is over a Hidden Markov Model representing all possible alignments A of the observed sequence i to the true sequence T. The HMM sums the probability of all possible ways the observed sequence could have been emitted from the true sequence via a stochastic process of sequencing error. The match states (M) correspond to letters of the true sequence T, and deletion (D), insertion (I) and emission probabilities are derived from the observed frequencies of sequencing errors conditioned on local sequence context (a 5mer window of T) and the observed PHRED quality score. This corresponds to treating the proper alignment of the observed sequence to the true sequence as uncertain, and therefore summing over all possibilities. This sum is calculated using dynamic programming with the recursion relation

where oi is the observed letter at position i, and tj is a five-residue window from the true sequence T centered on position j, Qi is the PHRED quality factor for position i in the observed sequence, means deletion of the residue at position j in T, and insert oi means insertion of oi as an 'extra letter' immediately before position j in T. The probabilities p(o|t,Q) constitute the detailed model of sequencing error which we have produced for EST data as described above. T is treated as a seven-residue window centred on the putative SNP position, and from the observed sequence we take the window of residues that align to these seven residues. The pij are calculated over the two dimensional matrix i = 1¼Lobs, j= 1¼LT, where Lobs and LT (=7) are the lengths of the observed and true sequence windows, respectively. Using p00 = 1 as the origin,

Secondly, we treat the true sequence itself as uncertain, and therefore sum p(obs|T) over all possible true sequences. Since candidate SNPs are evaluated within a 7-nucleotide window centered on the mismatch, this involves considering 47 = 16384 possible true sequences:

The SNP model p(obs|T*,T') differs from the error model first of all at the central position, where a simple weighted mixture of the consensus sequence T* and the putative SNP T' is used. Thus, at the central position of the HMM

where q is the expected allele frequency of T'.
Treating the ESTs as independent observations, we would calculate

cDNA Library Information
However, under the SNP model the EST observations are not independent, because they are drawn from a small number of cDNA libraries each representing (typically) one human individual. An important discriminant of genuine SNPs vs sequencing error can be obtained from the pattern of occurrence of the putative SNP over distinct library sources. Whereas sequencing errors should assort randomly over the set of libraries, a genuine SNP should follow a pattern consistent with diploid genetics. Specifically, the SNP should be observed predominantly 50% of the time from sequences from a heterozygote library, and 100% of the time from individuals homozygous for the rare allele. Thus, the frequency expected from a positive library (50%) is typically much greater than the allele frequency q in the general population. In this way library information for the observed sequences can be a powerful discriminant of real SNPs from the high background of sequencing error.
We have combined this library information with a Bayesian approach to estimate the true population allele frequency q of an SNP, subject to the quantity and quality of sequence data available for detecting it. The probability of an individual sequence, conditional on the zygosity value for its library z = 0, 1, 2 (copy number of the polymorphism in the individual from whom the DNA was derived) is

where qz = z/2 and is used in place of the allele frequency q in the HMM for the central position, above. Thus for a heterozygote library (z = 1) we use qz = 0.5. For multiple observations from one library L, we sum over uncertainty about the true value of z in the library:

This allows us to introduce a dependence on the population allele frequency q assuming Hardy-Weinberg equilibrium

For observations from multiple libraries, we substitute this expression for p(z) in the previous equation

Integrating over q,

In this paper we use the uninformative prior p(q) = 1 for all q. The p(q) distribution will be subject to more in depth analysis in follow-up works. The posterior distribution for q is therefore

One additional effect must be considered. Some libraries were constructed by pooling cDNAs from more than one individual. In this case we define the ploidy of the library P = 2N, where N is the number of people from whom DNA was pooled. Then z can take values z = 0, 1, 2, ¼ P. Also, qz = z/P, and p(z|q) is simply the binomial distribution, substituting P for 2 in the equation for p(z|q) above. We obtained detailed information on library construction and pooling for 850 libraries used in dbEST, and determined P for all 35 libraries annotated as pooled.
DUALITY OF INTEREST
None declared.
ABBREVIATIONS
SNP - Single Nucleotide Polymorphism
cSNP - coding region Single Nucleotide Polymorphism
eSNP - expressed Single Nucleotide Polymorphism
UTR - untranslated region
|
 | Acknowledgements
This work was supported by the following grants and awards: Department of Energy grant DEFG0387ER60615 to CL, Searle Scholar Award to CL, NIH grants K30HL0426, R01DK58851, U01GM61394 to JL, Dana Foundation award to JL, Stanley Foundation award to JL, NIH grants P50AT00151-020002 and R21MH/NS62777 to M-LW, NARSAD award to M-LW, support for KI from USPHS National Research Service Award GM08375, and support for KI from NSF IGERT award number 9987641.
|  |
| References |
 |
|
 |
| Figures |
 |
Figure 1 Characterization of SNP effects on protein coding. |
Figure 2 Contribution of EST-based approaches to coding region SNP discovery. |
 |
| Tables |
 |
Table 1 Obesity gene SNP discovery based on EST data |
Table 2 Amino acid changes in genes believed to be involved in obesity |
Table 3 Known disease associations for obesity gene cSNPs |
Table 4 Total combined SNPs from EST data and from public genomic sequencing (HGBASE, SNP Consortium), for the obesity candidate gene set |
 |
 |
 |
| Received 2 March 2001; revised 17 July 2001; accepted 18 July 2001 |
 |
| 2001, Volume 1, Number 3, Pages 193-203 |
 |
| Table of contents Previous Article Next [PDF] |
|
|