Abstract
To facilitate the association studies in complex diseases characterized by hyperhomocysteinemia, we collected structural and frequency data on single-nucleotide polymorphism (SNPs) in 24 genes relating to homocysteine metabolism. Firstly, we scanned ∼1.2 Mbp of sequence in the NCBI SNP database (dbSNP) build 110 and we detected 1353 putative SNPs with an average in silico genic density of 1:683. Out of 112 putative SNPs in coding regions (cSNPs), we selected a subset of 42 cSNPs and we assessed the applicability of the NCBI dbSNP to the Czech population – a typical representative of European Caucasians – by determining the frequency of the putative cSNPs experimentally by PCR-RFLP or ARMS-PCR in at least 110 control Czech chromosomes. As only 25 of the 42 analyzed cSNPs met the criterion of ≥1% frequency, the positive predictive value of the NCBI data set for our population reached 60%, which is similar to other studies. The correlation of SNP frequency between Czechs and other Caucasians – obtained from NCBI and/or literature – was stronger (r2=0.90 for 20 cSNPs) than between Czechs and general NCBI database entries (r2=0.73 for 27 cSNPs). Moreover, frequencies of all 20 putative cSNPs, for which data in Caucasians were available, were congruently below or above the 1% frequency criterion both in Czechs and in other Caucasians. In summary, our study shows that the NCBI dbSNP is a useful tool for selecting cSNPs for genetic studies of hyperhomocysteinemia in European populations, although experimental validation of SNPs should be performed, especially if the cSNP entry lacks any frequency data in Caucasians.
Similar content being viewed by others
Introduction
Homocysteine is a thiol-containing amino acid, which occupies a key position in the metabolism of one-carbon units and of sulfur compounds. Many clinical studies revealed an association of elevated plasma levels with an increased risk of cardiovascular disease1,2 or of other conditions.3 These studies, however, do not prove causality as they merely demonstrate an epidemiological correlation. Homocysteine metabolism is in part determined by genetic variants, which are fixed at conception and which do not typically change throughout life. Assuming Mendelian randomization, any observed association of these genetic factors with disease would suggest that the respective allelic variants are etiologically related to disease. Although association studies require that suitable genetic markers exist, a comprehensive list of such genetic variants in the field of homocysteine research is not available.
Polymorphism is defined as a heritable DNA change occurring in at least 1% of alleles; variants with frequency higher than 10% are considered common polymorphisms. Single-nucleotide polymorphisms (SNPs) represent the most frequent type of polymorphisms in human population and may be useful in association studies, as they may actually be functionally relevant, and/or might be in linkage disequilibrium with other such variants, which may have any effect. The number of discovered SNPs has increased tremendously over the past few years. The SNPs are present in different parts of human genome; variations in coding region together with changes in regulatory regions are believed to have the highest impact on phenotype.
Public SNP databases (dbSNPs) are a highly valuable resource of information about polymorphisms in the candidate genes. At present, several dbSNPs exist in public domain, their SNP content significantly overlaps but also complements.4 The dbSNP of National Center for Biotechnology Information is one of the central repositories for newly discovered genomic and cDNA sequence variations, both single base changes and short deletions and insertions.5 In this dbSNP, almost six million unique SNPs had been deposited as of November 2003 (dbSNP build 117). The quality of database entries was evaluated in several studies employing positive predictive value (ie the probability that a putative SNP entry in a database is indeed a true polymorphism for a given population, with frequency of the rare allele higher than 1%) and sensitivity (ie the probability that all existing SNPs are deposited in the database).6,7,8,9 The above studies analyzed samples of mixed ethnic origin,6,7,8,9 and to our knowledge, the role of ethnicity on predictive value of SNPs databases has been evaluated in only a few studies.10,11,12 It is also important to note that the above-mentioned reports examined genes that were otherwise not a subject of intense research in clinical samples, which may have caused a rather low sensitivity of dbSNP in one of these studies.6
The aims of our study were (a) to collect all available information on SNPs in 24 genes relating to homocysteine metabolism (either directly in the methionine cycle or indirectly in metabolism of vitamins) and (b) to assess the applicability of database entries to a typical Caucasian population from Central Europe. The applicability of database was evaluated for a subset of 42 putative SNPs in seven genes of folate and homocysteine metabolism by calculating the positive predictive value after determining the population frequency by PCR-RFLP or ARMS-PCR in at least 100 control Czech chromosomes.
Methods
SNP data mining from database
The SNPs in 24 genes relating to homocysteine metabolism were searched at the NCBI web page as of January 2003 (build 110); detailed information on the analyzed genes is given in Table 1. The in silico search was based on gene name or symbol, the candidate SNPs were manually localized to 5′UTR, introns, exons and 3′UTR of the particular gene using the GenBank reference sequence and recommended numbering starting with adenosine in the first ATG. The use of this numbering system led to discrepancies to some previously published SNPs (eg c.677C>T, c.1298A>C and c.1305C>T in the MTHFR gene). To collect the recent data on individual SNPs for Table 2, we updated the frequency using build 117 (November 2003) of the NCBI dbSNP.
Literature searches
To collect recent data on SNPs and their frequencies, we also explored the literature, using Medline searches with specific gene names to identify the relevant studies published as of November 2003. In addition, data from conference proceedings were used for completing the list of known polymorphisms.
Genotyping and determination of frequency in the Czech population
To evaluate the positive predictive value of dbSNP, we selected all 42 SNPs available in the build 110 of dbSNP (as of January 2003), which were localized in the coding regions of seven genes relating directly to homocysteine metabolism. The frequency of additional SNPs rising from dbSNP build 117 (as of November 2003) was not determined. The frequency of selected 42 cSNPs was estimated experimentally in the Czech population using PCR-RFLP or ARMS-PCR with allele-specific primer pairs (see Table I in web supplement). Quality control of each batch of samples was ensured by (i) the presence of an additional internal restriction site, (ii) complete cleavage of wild-type PCR product for SNP that destroys a naturally occurring restriction site, (iii) including samples with known genotype or (iv) using a different PCR product containing restriction site as an external control (for details see Table I in web supplement). Samples of genomic DNA from healthy controls aged between 18 and 65 years from a homogenous Caucasian population in the Czech Republic have been employed;28 at least 110 alleles (range 110–1194 alleles, median 300 alleles) were examined for the presence of each variant. Frequency of SNP was determined by counting the number of chromosomes carrying and lacking the variant.
Positive predictive value of database subset for Czech population
Positive predictive value was calculated in a subset of 42 SNPs as a ratio of the number of true polymorphisms (with frequency of the rare allele higher than 1%) to the total number of the putative SNPs that were found by in silico searches. Correlation was calculated using Prophet 5.0 software (BBN Systems and Technologies).
Results
In this study, we collected information on SNPs in 24 genes relating directly or indirectly to homocysteine metabolism. First, by in silico analysis, we scanned almost 1200 kbp of sequence in the NCBI database (build 110) and we detected 1353 putative SNP DNA variations, of which 85 were contained in the coding regions. The SNP density varied considerably for individual genes reaching a median of 1:683 for the genic regions and 1:412 for the coding regions (for details see Table II in web supplement). The median SNP densities in genes relevant to homocysteine metabolism are similar to the published estimates of 1:567 for the entire genome (dbSNP Summary build 117, as of November 2003).
As other researchers may utilize in their genetic studies data on polymorphisms in genes relating to homocysteine metabolism, we collected data on additional cSNPs, which were not subject of the below described experimental validation, and we also updated cSNPs frequencies from all available sources as of November 2003 (including literature and dbSNP build 117). Table 2 shows data on 112 putative or confirmed cSNPs, experimentally determined frequency of the rare allele was available for 47 and 67 entries employing non-Caucasian/mixed and Caucasian samples, respectively.
To evaluate the applicability of the NCBI database to the Czech population, we selected a subset of 42 putative dbSNP entries in seven genes of folate and homocysteine metabolism for experimental validation (see Figure 1). As the first step in assessing the positive predictive value of this NCBI database subset for our population, we determined the frequency of all 42 cSNPs in at least 100 Czech control chromosomes using PCR-RFLP or ARMS-PCR (frequencies are given in Table 2). We than examined whether each of the putative database SNP entries meets the definition criterion, that is, frequency of the rare allele at a locus higher than 1%. As only 25 variants out of 42 putative cSNPs met the definition criteria while the remaining 17 variants were false positives, the positive predictive value of this NCBI SNP subset for the studied Czech population is 60%. Consequently, the median density of experimentally validated cSNPs (ie 1:950) is about half of that predicted from in silico searches (ie 1:412, for details see Table II in web supplement), which corresponds well to other studies.29 Interestingly, the false-positive cSNP entries were either rare variants in the NCBI database (eight entries with frequency <3% in mixed samples) or the frequency was not available in the dbSNP (nine entries). These data suggest that dbSNP entries with low or missing frequency are more likely to be false positives in Caucasians.
It is possible that the failure of NCBI database in predicting some cSNPs in the Czech population may be a consequence of largely different SNP frequencies in samples used to create the NCBI entries. To test this hypothesis, we examined the role of ethnicity on frequency estimates. Of the 42 analyzed cSNPs, the NCBI database contained frequency information in non-Caucasian/mixed populations for 27 entries and in Caucasians for nine entries. In addition, literature contained frequency data on 15 cSNPs for several European populations. The correlation of SNP frequencies between Czechs and other Caucasians (r2=0.90, P=0.0001, Figure 2b) was substantially stronger than between Czech controls and the general NCBI data set (see Figure 2a, r2=0.73, P=0.0001). Moreover, frequency of all 20 putative cSNPs, for which data in Caucasians were available, were congruently below or above the 1% frequency threshold both in the Czech population and in other Caucasians. In summary, these data suggest that for genes relating to homocysteine metabolism the cSNPs validated in one Caucasian population may be truly polymorphic in other Caucasians.
Discussion
To assess the applicability of dbSNPs in the public domain to one of European populations, we collected and evaluated allele frequency data for 42 variant alleles relating to homocysteine metabolism. The positive predictive value of the NCBI data set for a typical Caucasian population was 60%, which is intermediate between the study of Cox et al6 and Reich et al.9 Cox et al have found that 55% of in silico detected polymorphisms in coding sequence were indeed found by experimental method, while Reich et al confirmed in independent resequencing over 88% of SNPs that were available in three different public and commercial databases. In summary, our study suggests that about two-thirds of NCBI dbSNP entries may be truly polymorphic in European populations, which corresponds very well to the conclusions of Marth et al ‘if a researcher uses the publicly available candidate SNPs for a study in a population, there is only a 66–70% chance that the SNPs have appreciable minor allele frequency’.7
The applicability of dbSNPs to study genetic variants in specific populations may be obscured by the presence of false-positive entries, which may constitute about one-third of database data.7 Two types of false positivity may exist due to errors either at the step of entry generation or by errors in validation of the SNPs in a given population sample. At the step of entry generation, the false positives may be generated by technical problems such as sequencing errors or by errors during the computational data mining procedure,6 or by analysis of patient samples and misclassification of pathogenic mutations as SNPs.5 When validating the frequency of putative SNPs in a population sample, false positivity may originate from insufficient methods for their detection or insufficient sample size. In our study, the genotyping errors were quite unlikely as we employed quality control. Moreover, we screened at least 300 alleles for SNPs appearing as monomorphic, which gave us a power of 95.1% to classify them as truly false positive. All these data strongly suggest that these putative SNPs are indeed absent in the studied Czech population sample. Another and the most likely source of false positivity may be the different ethnicity of samples, from which the respective entry was generated, and of samples in the studied population. Indeed, the comparison of cSNP frequency data between Czechs on one side and other Caucasians or non-Caucasian or mixed samples on the other side show that SNP frequency from unrelated populations are less correlated than between closely related populations. The larger distance between Czechs and general NCBI datapool corresponds well to the observations of others,10 who showed that frequencies between Koreans and other Asians correlated more strongly than between Koreans and general NCBI data set.
The databases may not contain all existing SNPs, which are reflected in another characteristic of the database, namely its sensitivity. The study of Cox et al6 suggested that the sensitivity of database may be quite low as he detected by in silico search only 27% of those polymorphisms, which were in his study discovered experimentally. Since we did not sequence the seven genes of interest using multiple control samples, we were unable to evaluate accurately the sensitivity of dbSNP. However, these genes were systematically analyzed by other researchers in numerous clinical samples obtained from patients disturbed of homocysteine metabolism. Indeed, by searching literature and by our own experimental work, we were able to find only three additional cSNPs, which were lacking in the NCBI dbSNP (they were discovered experimentally by sequencing clinical samples). In summary, it is conceivable that most of the genetic variation in the coding regions of these seven genes has been already detected owing to the systematic analysis of these genes by the community of homocysteine researchers.
In our study, we collected structural and frequency data on polymorphisms in selected genes relating to sulfur amino-acid metabolism. This set of data suggests that about two-thirds of SNPs found in database are indeed polymorphic in our population, and that majority of existing cSNPs in genes relating to homocysteine metabolism have already been deposited in the NCBI database. However, the data from our study should be interpreted with caution as the number of genes was quite small and confounding since the interest of the scientific community in these selected genes may exist. Nevertheless, our study shows that the NCBI dbSNP is a valuable tool for selecting markers for genetic studies, and that experimental validation of cSNPs should be performed, especially if frequency on the candidate polymorphism is low or lacking.
References
Brattstrom L, Wilcken DE : Homocysteine and cardiovascular disease: cause or effect? Am J Clin Nutr 2000; 72: 315–323.
Ueland PM, Refsum H, Beresford SA, Vollset SE : The controversy over homocysteine and cardiovascular risk. Am J Clin Nutr 2000; 72: 324–332.
Carmel R, Jacobsen D : Homocysteine in Health and Disease. Cambridge: Cambridge University Press, 2001.
Aerts J, Wetzels Y, Cohen N, Aerssens J : Data mining of public SNP databases for the selection of intragenic SNPs. Hum Mutat 2002; 20: 162–173.
Sherry ST, Ward MH, Kholodov M et al: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001; 29: 308–311.
Cox D, Boillot C, Canzian F : Data mining: efficiency of using sequence databases for polymorphism discovery. Hum Mutat 2001; 17: 141–150.
Marth G, Yeh R, Minton M et al: Single-nucleotide polymorphisms in the public domain: how useful are they? Nat Genet 2001; 27: 371–372.
Sachidanandam R, Weissman D, Schmidt SC et al: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001; 409: 928–933.
Reich DE, Gabriel SB, Altshuler D : Quality and completeness of SNP databases. Nat Genet 2003; 33: 457–458.
Lee JK, Kim HT, Cho SM et al: Characterization of 458 single nucleotide polymorphisms of disease candidate genes in the Korean population. J Hum Genet 2003; 48: 213–216.
Lee SG, Yoon Y, Hong S, Yoo J, Yang I, Song K : Allele frequency determination of publicly available cSNPs in the Korean population. Genet Med 2002; 4: 49S–51S.
Lazarus R, Klimecki WT, Palmer LJ et al: Single-nucleotide polymorphisms in the interleukin-10 gene: differences in frequencies, linkage disequilibrium patterns, and haplotypes in three United States ethnic groups. Genomics 2002; 80: 223–228.
Heil SG, Lievers KJ, Boers GH et al: Betaine-homocysteine methyltransferase (BHMT): genomic sequencing and relevance to hyperhomocysteinemia and vascular disease in humans. Mol Genet Metab 2000; 71: 511–519.
Lievers KJ, Kluijtmans LA, Heil SG et al: Cystathionine beta-synthase polymorphisms and hyperhomocysteinaemia: an association study. Eur J Hum Genet 2003; 11: 23–29.
Wang J, Hegele RA : Genomic basis of cystathioninuria (MIM 219500) revealed by multiple mutations in cystathionine gamma-lyase (CTH). Hum Genet 2003; 112: 404–408.
Devlin AM, Ling EH, Peerson JM et al: Glutamate carboxypeptidase II: a polymorphism associated with lower levels of serum folate and hyperhomocysteinemia. Hum Mol Genet 2000; 9: 2837–2844.
Brody LC, Conley M, Cox C et al: A polymorphism, R653Q, in the trifunctional enzyme methylenetetrahydrofolate dehydrogenase/methenyltetrahydrofolate cyclohydrolase/formyltetrahydrofolate synthetase is a maternal genetic risk factor for neural tube defects: report of the Birth Defects Research Group. Am J Hum Genet 2002; 71: 1207–1215.
Kahleova R, Palyzova D, Zvara K et al: Essential hypertension in adolescents: association with insulin resistance and with metabolism of homocysteine and vitamins. Am J Hypertens 2002; 15: 857–864.
van der Put NM, Gabreels F, Stevens EM et al: A second common mutation in the methylenetetrahydrofolate reductase gene: an additional risk factor for neural-tube defects? Am J Hum Genet 1998; 62: 1044–1051.
van Der Put NM, Blom HJ : Reply to Donnelly. Am J Hum Genet 2000; 66: 744–745.
Rady PL, Szucs S, Grady J et al: Genetic polymorphisms of methylenetetrahydrofolate reductase (MTHFR) and methionine synthase reductase (MTRR) in ethnic populations in Texas; a report of a novel MTHFR polymorphic site, G1793A. Am J Med Genet 2002; 107: 162–168.
Gaughan DJ, Kluijtmans LA, Barbaux S et al: The methionine synthase reductase (MTRR) A66G polymorphism is a novel genetic determinant of plasma homocysteine concentrations. Atherosclerosis 2001; 157: 451–456.
Chango A, Emery-Fillon N, de Courcy GP et al: A polymorphism (80G->A) in the reduced folate carrier gene and its associations with folate status and homocysteinemia. Mol Genet Metab 2000; 70: 310–315.
Heil SG, Van der Put NM, Waas ET, den Heijer M, Trijbels FJ, Blom HJ : Is mutated serine hydroxymethyltransferase (SHMT) involved in the etiology of neural tube defects? Mol Genet Metab 2001; 73: 164–172.
Li N, Seetharam S, Seetharam B : Genomic structure of human transcobalamin II: comparison to human intrinsic factor and transcobalamin I. Biochem Biophys Res Commun 1995; 208: 756–764.
Lievers KJ, Afman LA, Kluijtmans LA et al: Polymorphisms in the transcobalamin gene: association with plasma homocysteine in healthy individuals and vascular disease patients. Clin Chem 2002; 48: 1383–1389.
Li N, Sood GK, Seetharam S, Seetharam B : Polymorphism of human transcobalamin II: substitution of proline and/or glutamine residues by arginine. Biochim Biophys Acta 1994; 1219: 515–520.
Janosikova B, Pavlikova M, Kocmanova D et al: Genetic variants of homocysteine metabolizing enzymes and the risk of coronary artery disease. Mol Genet Metab 2003; 79: 167–175.
Zhao Z, Fu YX, Hewett-Emmett D, Boerwinkle E : Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. Gene 2003; 312: 207–213.
Acknowledgements
We thank Mrs Jitka Sokolová for genotyping, Mrs Eva Richterová for excellent technical assistance and Dr Joseph D Terwilliger for critical reading of an earlier version of this manuscript. This study was supported by the Grant NM 6548-3 from Grant Agency of Ministry of Health of the Czech Republic. VK is supported by the Wellcome Trust International Senior Research Fellowship in Biomedical Science in Central Europe.
Author information
Authors and Affiliations
Corresponding author
Additional information
Databases
http://www.ncbi.nlm.nih.gov/SNP/;http://www.ncbi.nlm.nih.gov/Genbank/.
Supplementary Information accompanies the paper on European Journal of Human Genetics website (www.nature.com/ejhg)
Supplementary information
Rights and permissions
About this article
Cite this article
Janošíková, B., Zavadáková, P. & Kožich, V. Single-nucleotide polymorphisms in genes relating to homocysteine metabolism: how applicable are public SNP databases to a typical European population?. Eur J Hum Genet 13, 86–95 (2005). https://doi.org/10.1038/sj.ejhg.5201282
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/sj.ejhg.5201282
Keywords
This article is cited by
-
CHAT gene polymorphism rs3810950 is associated with the risk of Alzheimer’s disease in the Czech population
Journal of Biomedical Science (2018)
-
A 40-bp VNTR polymorphism in the 3′-untranslated region of DAT1/SLC6A3 is associated with ADHD but not with alcoholism
Behavioral and Brain Functions (2015)
-
Association between 5q23.2-located polymorphism of CTXN3 gene (Cortexin 3) and schizophrenia in European-Caucasian males; implications for the aetiology of schizophrenia
Behavioral and Brain Functions (2015)
-
Relationship between two sequence variations in the gene for peroxisome proliferator-activated receptor-gamma and plasma homocysteine concentration. Health in men study
Human Genetics (2008)
-
Evaluating HapMap SNP data transferability in a large-scale genotyping project involving 175 cancer-associated genes
Human Genetics (2006)