Introduction

Homocysteine is a thiol-containing amino acid, which occupies a key position in the metabolism of one-carbon units and of sulfur compounds. Many clinical studies revealed an association of elevated plasma levels with an increased risk of cardiovascular disease1,2 or of other conditions.3 These studies, however, do not prove causality as they merely demonstrate an epidemiological correlation. Homocysteine metabolism is in part determined by genetic variants, which are fixed at conception and which do not typically change throughout life. Assuming Mendelian randomization, any observed association of these genetic factors with disease would suggest that the respective allelic variants are etiologically related to disease. Although association studies require that suitable genetic markers exist, a comprehensive list of such genetic variants in the field of homocysteine research is not available.

Polymorphism is defined as a heritable DNA change occurring in at least 1% of alleles; variants with frequency higher than 10% are considered common polymorphisms. Single-nucleotide polymorphisms (SNPs) represent the most frequent type of polymorphisms in human population and may be useful in association studies, as they may actually be functionally relevant, and/or might be in linkage disequilibrium with other such variants, which may have any effect. The number of discovered SNPs has increased tremendously over the past few years. The SNPs are present in different parts of human genome; variations in coding region together with changes in regulatory regions are believed to have the highest impact on phenotype.

Public SNP databases (dbSNPs) are a highly valuable resource of information about polymorphisms in the candidate genes. At present, several dbSNPs exist in public domain, their SNP content significantly overlaps but also complements.4 The dbSNP of National Center for Biotechnology Information is one of the central repositories for newly discovered genomic and cDNA sequence variations, both single base changes and short deletions and insertions.5 In this dbSNP, almost six million unique SNPs had been deposited as of November 2003 (dbSNP build 117). The quality of database entries was evaluated in several studies employing positive predictive value (ie the probability that a putative SNP entry in a database is indeed a true polymorphism for a given population, with frequency of the rare allele higher than 1%) and sensitivity (ie the probability that all existing SNPs are deposited in the database).6,7,8,9 The above studies analyzed samples of mixed ethnic origin,6,7,8,9 and to our knowledge, the role of ethnicity on predictive value of SNPs databases has been evaluated in only a few studies.10,11,12 It is also important to note that the above-mentioned reports examined genes that were otherwise not a subject of intense research in clinical samples, which may have caused a rather low sensitivity of dbSNP in one of these studies.6

The aims of our study were (a) to collect all available information on SNPs in 24 genes relating to homocysteine metabolism (either directly in the methionine cycle or indirectly in metabolism of vitamins) and (b) to assess the applicability of database entries to a typical Caucasian population from Central Europe. The applicability of database was evaluated for a subset of 42 putative SNPs in seven genes of folate and homocysteine metabolism by calculating the positive predictive value after determining the population frequency by PCR-RFLP or ARMS-PCR in at least 100 control Czech chromosomes.

Methods

SNP data mining from database

The SNPs in 24 genes relating to homocysteine metabolism were searched at the NCBI web page as of January 2003 (build 110); detailed information on the analyzed genes is given in Table 1. The in silico search was based on gene name or symbol, the candidate SNPs were manually localized to 5′UTR, introns, exons and 3′UTR of the particular gene using the GenBank reference sequence and recommended numbering starting with adenosine in the first ATG. The use of this numbering system led to discrepancies to some previously published SNPs (eg c.677C>T, c.1298A>C and c.1305C>T in the MTHFR gene). To collect the recent data on individual SNPs for Table 2, we updated the frequency using build 117 (November 2003) of the NCBI dbSNP.

Table 1 Genes included in this study
Table 2 Summary of all identified SNPs in coding regions

Literature searches

To collect recent data on SNPs and their frequencies, we also explored the literature, using Medline searches with specific gene names to identify the relevant studies published as of November 2003. In addition, data from conference proceedings were used for completing the list of known polymorphisms.

Genotyping and determination of frequency in the Czech population

To evaluate the positive predictive value of dbSNP, we selected all 42 SNPs available in the build 110 of dbSNP (as of January 2003), which were localized in the coding regions of seven genes relating directly to homocysteine metabolism. The frequency of additional SNPs rising from dbSNP build 117 (as of November 2003) was not determined. The frequency of selected 42 cSNPs was estimated experimentally in the Czech population using PCR-RFLP or ARMS-PCR with allele-specific primer pairs (see Table I in web supplement). Quality control of each batch of samples was ensured by (i) the presence of an additional internal restriction site, (ii) complete cleavage of wild-type PCR product for SNP that destroys a naturally occurring restriction site, (iii) including samples with known genotype or (iv) using a different PCR product containing restriction site as an external control (for details see Table I in web supplement). Samples of genomic DNA from healthy controls aged between 18 and 65 years from a homogenous Caucasian population in the Czech Republic have been employed;28 at least 110 alleles (range 110–1194 alleles, median 300 alleles) were examined for the presence of each variant. Frequency of SNP was determined by counting the number of chromosomes carrying and lacking the variant.

Positive predictive value of database subset for Czech population

Positive predictive value was calculated in a subset of 42 SNPs as a ratio of the number of true polymorphisms (with frequency of the rare allele higher than 1%) to the total number of the putative SNPs that were found by in silico searches. Correlation was calculated using Prophet 5.0 software (BBN Systems and Technologies).

Results

In this study, we collected information on SNPs in 24 genes relating directly or indirectly to homocysteine metabolism. First, by in silico analysis, we scanned almost 1200 kbp of sequence in the NCBI database (build 110) and we detected 1353 putative SNP DNA variations, of which 85 were contained in the coding regions. The SNP density varied considerably for individual genes reaching a median of 1:683 for the genic regions and 1:412 for the coding regions (for details see Table II in web supplement). The median SNP densities in genes relevant to homocysteine metabolism are similar to the published estimates of 1:567 for the entire genome (dbSNP Summary build 117, as of November 2003).

As other researchers may utilize in their genetic studies data on polymorphisms in genes relating to homocysteine metabolism, we collected data on additional cSNPs, which were not subject of the below described experimental validation, and we also updated cSNPs frequencies from all available sources as of November 2003 (including literature and dbSNP build 117). Table 2 shows data on 112 putative or confirmed cSNPs, experimentally determined frequency of the rare allele was available for 47 and 67 entries employing non-Caucasian/mixed and Caucasian samples, respectively.

To evaluate the applicability of the NCBI database to the Czech population, we selected a subset of 42 putative dbSNP entries in seven genes of folate and homocysteine metabolism for experimental validation (see Figure 1). As the first step in assessing the positive predictive value of this NCBI database subset for our population, we determined the frequency of all 42 cSNPs in at least 100 Czech control chromosomes using PCR-RFLP or ARMS-PCR (frequencies are given in Table 2). We than examined whether each of the putative database SNP entries meets the definition criterion, that is, frequency of the rare allele at a locus higher than 1%. As only 25 variants out of 42 putative cSNPs met the definition criteria while the remaining 17 variants were false positives, the positive predictive value of this NCBI SNP subset for the studied Czech population is 60%. Consequently, the median density of experimentally validated cSNPs (ie 1:950) is about half of that predicted from in silico searches (ie 1:412, for details see Table II in web supplement), which corresponds well to other studies.29 Interestingly, the false-positive cSNP entries were either rare variants in the NCBI database (eight entries with frequency <3% in mixed samples) or the frequency was not available in the dbSNP (nine entries). These data suggest that dbSNP entries with low or missing frequency are more likely to be false positives in Caucasians.

Figure 1
figure 1

Homocysteine metabolism. Selected genes relating to homocysteine metabolism, which were selected for the experimental validation of cSNPs in this study, are shown in shaded ellipses (for abbreviations of gene names see Table 1).

It is possible that the failure of NCBI database in predicting some cSNPs in the Czech population may be a consequence of largely different SNP frequencies in samples used to create the NCBI entries. To test this hypothesis, we examined the role of ethnicity on frequency estimates. Of the 42 analyzed cSNPs, the NCBI database contained frequency information in non-Caucasian/mixed populations for 27 entries and in Caucasians for nine entries. In addition, literature contained frequency data on 15 cSNPs for several European populations. The correlation of SNP frequencies between Czechs and other Caucasians (r2=0.90, P=0.0001, Figure 2b) was substantially stronger than between Czech controls and the general NCBI data set (see Figure 2a, r2=0.73, P=0.0001). Moreover, frequency of all 20 putative cSNPs, for which data in Caucasians were available, were congruently below or above the 1% frequency threshold both in the Czech population and in other Caucasians. In summary, these data suggest that for genes relating to homocysteine metabolism the cSNPs validated in one Caucasian population may be truly polymorphic in other Caucasians.

Figure 2
figure 2

(a) Correlation of frequencies determined in the Czech population with frequencies found in NCBI database regardless of ethnicity. (b) Correlation of frequencies determined in the Czech population with frequencies among Caucasians found in NCBI database or literature. Dashed curves define the 95% confidence intervals of the regression lines (f(x)=1.018x–0.01228 and f(x)= 0.9773x−0.01479 for (a) and (b), respectively).

Discussion

To assess the applicability of dbSNPs in the public domain to one of European populations, we collected and evaluated allele frequency data for 42 variant alleles relating to homocysteine metabolism. The positive predictive value of the NCBI data set for a typical Caucasian population was 60%, which is intermediate between the study of Cox et al6 and Reich et al.9 Cox et al have found that 55% of in silico detected polymorphisms in coding sequence were indeed found by experimental method, while Reich et al confirmed in independent resequencing over 88% of SNPs that were available in three different public and commercial databases. In summary, our study suggests that about two-thirds of NCBI dbSNP entries may be truly polymorphic in European populations, which corresponds very well to the conclusions of Marth et al ‘if a researcher uses the publicly available candidate SNPs for a study in a population, there is only a 66–70% chance that the SNPs have appreciable minor allele frequency’.7

The applicability of dbSNPs to study genetic variants in specific populations may be obscured by the presence of false-positive entries, which may constitute about one-third of database data.7 Two types of false positivity may exist due to errors either at the step of entry generation or by errors in validation of the SNPs in a given population sample. At the step of entry generation, the false positives may be generated by technical problems such as sequencing errors or by errors during the computational data mining procedure,6 or by analysis of patient samples and misclassification of pathogenic mutations as SNPs.5 When validating the frequency of putative SNPs in a population sample, false positivity may originate from insufficient methods for their detection or insufficient sample size. In our study, the genotyping errors were quite unlikely as we employed quality control. Moreover, we screened at least 300 alleles for SNPs appearing as monomorphic, which gave us a power of 95.1% to classify them as truly false positive. All these data strongly suggest that these putative SNPs are indeed absent in the studied Czech population sample. Another and the most likely source of false positivity may be the different ethnicity of samples, from which the respective entry was generated, and of samples in the studied population. Indeed, the comparison of cSNP frequency data between Czechs on one side and other Caucasians or non-Caucasian or mixed samples on the other side show that SNP frequency from unrelated populations are less correlated than between closely related populations. The larger distance between Czechs and general NCBI datapool corresponds well to the observations of others,10 who showed that frequencies between Koreans and other Asians correlated more strongly than between Koreans and general NCBI data set.

The databases may not contain all existing SNPs, which are reflected in another characteristic of the database, namely its sensitivity. The study of Cox et al6 suggested that the sensitivity of database may be quite low as he detected by in silico search only 27% of those polymorphisms, which were in his study discovered experimentally. Since we did not sequence the seven genes of interest using multiple control samples, we were unable to evaluate accurately the sensitivity of dbSNP. However, these genes were systematically analyzed by other researchers in numerous clinical samples obtained from patients disturbed of homocysteine metabolism. Indeed, by searching literature and by our own experimental work, we were able to find only three additional cSNPs, which were lacking in the NCBI dbSNP (they were discovered experimentally by sequencing clinical samples). In summary, it is conceivable that most of the genetic variation in the coding regions of these seven genes has been already detected owing to the systematic analysis of these genes by the community of homocysteine researchers.

In our study, we collected structural and frequency data on polymorphisms in selected genes relating to sulfur amino-acid metabolism. This set of data suggests that about two-thirds of SNPs found in database are indeed polymorphic in our population, and that majority of existing cSNPs in genes relating to homocysteine metabolism have already been deposited in the NCBI database. However, the data from our study should be interpreted with caution as the number of genes was quite small and confounding since the interest of the scientific community in these selected genes may exist. Nevertheless, our study shows that the NCBI dbSNP is a valuable tool for selecting markers for genetic studies, and that experimental validation of cSNPs should be performed, especially if frequency on the candidate polymorphism is low or lacking.