Introduction

The multicountry effort to identify and catalog genetic similarities and differences in human beings has resulted in the creation of the public-domain genome database HapMap.1 This database comprises data on human genomic polymorphisms, primarily single nucleotide polymorphisms (SNPs), on populations of African (Yoruba of Ibadan, Nigeria; n=30 trios), Asian (Japanese of Tokyo; n=45 unrelated individuals, and Han Chinese from Beijing area; n=45 unrelated individuals) and European (n=30 CEPH trios) ancestry. Another public-domain database, called dbSNP,2 also catalogs SNP variation in the human genome based primarily on submissions by individual researchers. These public-domain databases have become valuable resources for designing human genetic studies, and for comparison of data and results generated from projects undertaken in different parts of the world. In the course of a population genetics project – undertaken to study genomic diversity among Indian ethnic groups – that involves resequencing of large regions of human chromosomes, we were astounded to find our data, pertaining to 45 phenotypically normal individuals (comprising approximately equal numbers of adult males and females) drawn from diverse ethnic groups of India, inhabiting various geographical zones, to be completely discordant with those provided in the HapMap database for a 6 kb region of chromosome 1. We did not find such discrepancies in many other regions of chromosome 1 or other chromosomes (data not shown). We carried out investigations to identify the causes of the discrepancies found for chromosome 1. Our results show that the HapMap data need to be used with some caution, especially because of the presence of insertions of sequences in the nuclear genome from the mitochondrial genome3 and segmental duplications4 within the nuclear genome.

In the short arm of human chromosome 1 (nucleotide positions 604327–610167), there is an insertion5 of 5841 nucleotides (6 kb) from the human mitochondrial DNA (mtDNA). There are a large number of such insertions in the human nuclear DNA, termed as NUMTs.6 These insertion events have been estimated to have taken place at different times.7 The 5.8 kb NUMT in the nuclear DNA has 98.54% nucleotide identity with the corresponding segment of the mtDNA. Highly homologous partial copies of this 5.8 kb NUMT were also located in other parts of the nuclear genome. In the nuclear DNA, this NUMT is flanked on both sides by mammalian interspersed elements. We hypothesized that the data presented in the HapMap database for this region pertain to the mtDNA, not nuclear DNA.

Methods

To test this hypothesis, we carried out experiments using a strategy (Supplementary Information) of amplifying only the NUMT and not the corresponding homologous segment of mtDNA. It may be noted that standard experimental approaches of DNA amplification using overlapping PCR primers designed to amplify 0.5–1 kb of DNA sequence preferentially amplify homologous segments present in mtDNA because of the presence of large number of copies of mtDNA in a human cell. We used the ABI 3100 DNA sequencer to carry out bidirectional resequencing of overlapping PCR fragments. Variant sites were detected using the Polyphred, Ver. 4.0.8 Human nuclear reference sequence was taken from NCBI Build 359 and mtDNA reference sequence from the Mitomap10 database.

Results and discussion

We detected 18 polymorphic or variant loci in this 5.8 kb nuclear region, of which 17 are single nucleotide changes, and one is a dinucleotide insertion (Table 1). The insertion was detected by visual inspection of bidirectional DNA sequences and standard approaches of offsetting and overlaying chromatograms obtained from resequencing from opposite directions. Only three of these loci are reported in dbSNP (Build 126). One possible reason for this maybe that the 15 SNPs unreported in dbSNP are specific to Indian ethnic groups. It is also possible that this is a reflection of incomplete submissions of data to dbSNP. For this genomic region, 120 SNPs are listed in dbSNP, of which 29 were selected and genotyped in the International HapMap Project.11 We note that 71 of these 120 SNPs are reported in Mitomap10 or mtDB12 databases as mtDNA variants in the corresponding homologous positions. Of the 18 variant sites detected by us, none coincides to the 29 sites chosen in the HapMap study. (We note that there is variation in allele frequencies at these sites across geographical regions within India. In fact, some of the sites are monomorphic in one or more geographical regions. The sample sizes within geographical regions are too small to draw valid inferences regarding the statistical significance of variation in allele frequencies across geographical regions, which in any case was not the focus of this study.) The relevant HapMap data are given in Table 2. These data show that (a) if a site in the HapMap database is monomorphic, then irrespective of the nucleotide in the Reference Sequence, the nucleotide reported in HapMap is the nucleotide (major allele) that is present at the corresponding position in the mtDNA (as assessed from the Mitomap database,10 which is a compendium of polymorphisms and mutations of the human mtDNA and mtDNA resequencing of all the 45 individuals included in this study) and (b) if the site in the HapMap database is polymorphic, then the major and minor alleles at this site are exactly the same as those in mtDNA, irrespective of the nucleotide in the Reference Sequence. These features clearly support our hypothesis that for this 6 kb region, the data reported in HapMap pertain to mtDNA, not nuclear DNA.

Table 1 Characteristics and allele frequencies of the 18 SNPs on the chromosome 1 region identified in the present study
Table 2 Characteristics of loci investigated in the International HapMap Project for a 5.8 kb region on chromosome 1

NUMTs are dispersed on all human chromosomes. It has been reported6 that 200 kb of the human nuclear genomic sequence shows significant levels of similarity to the human mtDNA. Further, segmental duplications cover 5.3% of the human genome.13 The duplicated regions have diverged to different degrees. Standard methods of PCR amplification and DNA resequencing can lead to detection of a SNP in one of these regions when in fact the SNP belongs to the homologous duplicated segment of this region. For these inserted (NUMTs) or duplicated regions, our results indicate the need for exercising caution while using the HapMap data.