This page has been archived and is no longer updated
PROVIDED BY
AND
EXPLORE
Related Subjects
Genetics
Gene Inheritance and Transmission
Gene Expression and Regulation
Nucleic Acid Structure and Function
Chromosomes and Cytogenetics
Evolutionary Genetics
Population and Quantitative Genetics
Genomics
Genes and Disease
Genetics and Society
Cell Biology
Cell Origins and Metabolism
Proteins and Gene Expression
Subcellular Compartments
Cell Communication
Cell Cycle and Cell Division
Molecular Biology
Biochemistry
Immunology
Working in Science
UPDATES
Loading ...
CONNECT
GO
A second generation human haplotype map of over 3.1 million SNPs
Author: -100
YOUR KEYWORDS
Keywords for this Article
You may personalize your own list of keywords for this Nature Education article. After entering and saving them in the box below, they will be only visible to you when you are logged into WLoS.
Cancel
Save
Share
|
Cancel
Revoke
|
Cancel
Rate & Certify
Rate Me...
Rate Me
!
Comment
Save
|
Cancel
Flag Inappropriate
The Content is
Objectionable
Explicit
Offensive
Inaccurate
Comment
Cancel
Flag Content
Delete Content
Reason
Delete
|
Cancel
Full Screen
"ARTICLES A second generation human haplotype map of over 3.1 million SNPs The International HapMap Consortium* We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotypedin270individualsfromfourgeographicallydiversepopulationsandincludes25?35%ofcommonSNPvariationin the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r 2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r 2 of up to 0.8 in African and up to 0.95 in non-Africanpopulations,andthatpotentialgainsinpowerinassociationstudiescanbeobtainedthroughimputation.These dataalsorevealnovelaspectsofthestructureoflinkagedisequilibrium.Weshowthat10?30%ofpairsofindividualswithin a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiationatnon-synonymous,comparedtosynonymous,SNPs,resultingfromsystematicdifferencesinthestrengthor efficacy of natural selection between populations. Advances made possible by the Phase I haplotype map The International HapMap Project was launched in 2002 with the aim of providing a public resource to accelerate medical genetic research. The objective was to genotype at least one common SNP every 5kilobases (kb) across the euchromatic portion of the genome in270individualsfromfourgeographicallydiversepopulations 1,2 :30 mother?father?adult child trios from the Yoruba in Ibadan, Nigeria (abbreviated YRI); 30trios ofnorthernandwestern Europeanances- try living in Utah from the Centre d?Etude du Polymorphisme Humain (CEPH) collection (CEU); 45 unrelated Han Chinese indi- viduals in Beijing, China (CHB); and 45 unrelated Japanese indivi- duals in Tokyo, Japan (JPT). The YRI samples and the CEU samples eachform ananalysis panel;theCHBandJPTsamples togetherform ananalysispanel.Approximately1.3millionSNPsweregenotypedin Phase I of the project, and a description of this resource was pub- lished in 2005 (ref. 3). The initial HapMap Project data had a central role in the develop- ment of methods for the design and analysis of genome-wide asso- ciation studies. These advances, alongside the release of commercial platforms for performing economically viable genome-wide geno- typing, have led to a new phase in human medical genetics. Already, large-scale studies have identified novel loci involved in multiple complex diseases 4,5 . In addition, the HapMap data have led to novel insights into the distribution and causes of recombination hot- spots 3,6 , the prevalence of structural variation 7,8 and the identity of genes that have experienced recent adaptive evolution 3,9 . Because the HapMapcelllinesarepubliclyavailable,many groupshavebeenable to integrate their own experimental data with the genome-wide SNP data to gain new insight into copy-number variation 10 , the relation- ship between classical human leukocyte antigen (HLA) types and SNP variation 11 , and heritable influences on gene expression 12?14 . The ability to combine genome-wide data on such diverse aspects of genetic variation with molecular phenotypes collected in the same samples provides a powerful framework to study the connection of DNA sequence to function. In Phase II of the HapMap Project, a further 2.1 million SNPs were successfully genotyped on the same individuals. The resulting HapMap has an SNP density of approximately one per kilobase and is estimated to contain approximately 25?35% of all the 9?10 millioncommonSNPs(minorallelefrequency(MAF)$0.05)inthe assembled human genome (that is, excluding gaps in the reference sequence alignment; see Supplementary Text 1), although this num- ber shows extensive local variation. This paper describes the Phase II resource, its implications for genome-wide association studies and additional insights into the fine-scale structure of linkage disequilib- rium, recombination and natural selection. Construction of the Phase II HapMap Most of the additional genotype data for the Phase II HapMap were obtained using the Perlegen amplicon-based platform 15 . Briefly, this platform uses custom oligonucleotide arrays to type SNPs in DNA segmentally amplified via long-range polymerase chain reaction (PCR). Genotyping was attempted at 4,373,926 distinct SNPs, which corresponds, with exceptions (see Methods), to nearly all SNPs in dbSNP release 122 for which an assay could be designed. Additional submissions were included from the Affymetrix GeneChip Mapping Array 500K set, the Illumina HumanHap100 and HumanHap300 SNP assays, a set of ,11,000 non-synonymous SNPs genotyped by Affymetrix (ParAllele) and a set of,4,500 SNPs within the extended major histocompatibility complex (MHC) 11 . Genotype submissions were subjected to the same quality control (QC) filters as described previously (see Methods) and mapped to NCBI build 35 (University ofCaliforniaatSantaCruz(UCSC)hg17)ofthehumangenome.The re-mapping of SNPs from Phase I of the project identified 21,177 SNPs that had an ambiguous position or some other feature indi- cative of low reliability; these are not included in the filtered Phase II data release. All genotype data are available from the HapMap Data Coordination Center (http://www.hapmap.org) and dbSNP (http:// www.ncbi.nlm.nih.gov/SNP); analyses described in this paper refer to release 21a. Three data sets are available: ?redundant unfiltered? *Lists of participants and affiliations appear at the end of the paper. Vol 449|18 October 2007|doi:10.1038/nature06258 851 Nature �2007 Publishing Group contains all genotype submissions, ?redundant filtered? contains all submissions that pass QC, and ?non-redundant filtered? contains a single QC1 submission for each SNP in each analysis panel. The QC filters remove SNPs showing gross errors. However, it is also important to understand the magnitude and structure of more subtle genotyping errors among SNPs that pass QC. We therefore carriedoutaseriesofanalysestoassesstheinfluenceofthelong-range PCR amplicon structure on genotyping error, the concordance rates between genotype calls from different genotyping platforms and betweenthose platforms and re-sequencingassays, as well as therates of false monomorphism and mis-mapping of SNPs (see Supplemen- tary Text 2, Supplementary Figs 1?3 and Supplementary Tables 1?4). We estimate that the average per genotype accuracy is at least 99.5%. However, there are higher rates of missing data and genotype discre- pancies at non-reference alleles, withsome clustering of errors result- ing from the amplicon design and a few incorrectly mapped SNPs. Table 1 shows the numbers of SNPs attempted and converted to QC1 SNPs in each analysis panel (Supplementary Table 5 shows a breakdownbyeachmajorsubmission).Haplotypesandmissingdata were estimated for each analysis panel separately using both trio information and statistical methods based on the coalescent model (see Methods). To enable cross-population comparisons, a con- sensus data set was created consisting of 3,107,620 SNPs that were QC1 in all analysis panels and polymorphic in at least one analysis panel. The equivalent figure from Phase I was 931,340 SNPs. Unless stated otherwise, all analyses have been carried out on the consensus dataset.AnadditionalsetofhaplotypeswascreatedforthoseSNPsin the consensus where a putative ancestral state could be assigned by comparison of the human alleles to the orthologous position in the chimpanzee and rhesus macaque genomes. ThevariationinSNPdensitywithinthePhaseIIHapMapisshown inFig.1.Onaverage thereare1.14genotyped polymorphic SNPsper kilobase (average spacing is 875base pairs (bp)) and 98.6% of the assembled genome is within 5kb of the nearest polymorphic SNP. Still, there is heterogeneity in genotyped SNP density at both broad (Fig. 1a) and fine (Fig. 1b) scales. Furthermore, there are systematic changes in genotyped SNP density around genomic features includ- ing genes (Fig. 1c). The Phase II HapMap differs from the Phase I HapMap not only in SNP spacing, but also in minor allele frequency distribution and patterns of linkage disequilibrium (Supplementary Fig. 4). Because the criteria for choosing additional SNPs did not include considera- tion of SNP spacing or preferential selection for high MAF, the SNPs added in Phase II are, on average, more clustered and have lower MAF than the Phase I SNPs. Because MAF predictably influences the distribution of linkage disequilibrium statistics, the average r 2 at a given physical distance is typically lower in Phase II than in Phase I; conversely,the jD9j statisticistypicallyhigher(datanotshown).One notable consequence is that the Phase II HapMap includes a better representation of rare variation than the Phase I HapMap. The increased resolution provided by Phase II of the project is illustrated in Fig. 2. Broadly, an additional SNP added to a region showsoneofthreepatterns.First,itmaybeverysimilarindistribution toSNPspresentinPhaseI.Second,itmayprovidedetailedresolution of haplotype structure (for example, a group of chromosomes with identical local haplotypes in Phase I can be shown in Phase II to carry Table 1 | Summary of Phase II HapMap data (release 21) Phase SNP categories Analysis panel YRI CEU CHB1JPT I Assays submitted 1,304,199 1,344,616 1,306,125 Passed QC 1,177,312 (90%) 1,217,902 (91%) 1,187,800 (91%) Did not pass QC 126,887 (10%) 126,714 (9%) 118,325 (9%) .20% missing 82,463 (65%) 95,684 (76%) 78,323 (66%) .1 duplicate inconsistent 6,049 (5%) 5,126 (4%) 9,242 (8%) .1 mendelian error 18,916 (15%) 11,310 (9%) N/A ,0.001 Hardy?Weinberg P -value 10,265 (8%) 8,922 (7%) 13,722 (12%) Other failures 19,345 (15%) 13,858 (11%) 20,674 (17%) II Assays submitted 5,044,989 5,044,996 5,043,775 Passed QC 3,150,433 (62%) 3,204,709 (64%) 3,244,897 (64%) Did not pass QC 1,894,556 (38%) 1,840,287 (36%) 1,798,878 (36%) .20% missing 1,419,000 (75%) 1,398,166 (76%) 1,403,543 (78%) .1 duplicate inconsistent 0 (0%) 0 (0%) 6,617 (0%) .1 mendelian error 172,339 (9%) 127,923 (7%) N/A ,0.001 Hardy?Weinberg P -value 96,231 (5%) 82,268 (4%) 108,880 (6%) Other failures 334,511 (18%) 337,906 (18%) 340,370 (19%) Overall Assays submitted 6,349,188 6,389,612 6,349,900 Passed QC 4,327,745 (68%) 4,422,611 (69%) 4,432,697 (70%) Did not pass QC 2,021,443 (32%) 1,967,001 (31%) 1,917,203 (30%) .20% missing 1,501,463 (74%) 1,493,850 (76%) 1,481,866 (77%) .1 duplicate inconsistent 6,049 (0%) 5,126 (0%) 15,859 (1%) .1 mendelian error 191,255 (9%) 139,233 (7%) N/A ,0.001 Hardy?Weinberg P -value 106,496 (5%) 91,190 (5%) 122,602 (6%) Other failures 353,856 (18%) 351,764 (18%) 361,044 (19%) Non-redundant (unique) SNPs 3,796,934 3,868,157 3,890,416 Monomorphic 861,299 (23%) 1,246,183 (32%) 1,410,152 (36%) Polymorphic 2,935,635 (77%) 2,621,974 (68%) 2,480,264 (64%) SNP categories All analysis panels Unique QC-passed SNPs 4,000,107 Passed in one analysis panel 88,140 (2%) Passed in two analysis panels 268,534 (7%) Passed in three analysis panels (QC13) 3,643,433 (91%) QC13 and monomorphic across three analysis panels 535,813 QC13 and polymorphic in at least one analysis panel 3,107,620 QC13 and polymorphic in all three analysis panels 2,006,352 QC13 and MAF$0.05 in at least one of three analysis panels 2,819,322 ARTICLES NATURE|Vol 449|18 October 2007 852 Nature �2007 Publishing Group multiplerelatedhaplotypes).Third,thenovelSNP(orgroupofadded SNPs) may reveal previously missed recombinant haplotypes. The extent to which each type of event occurs varies among populations andchromosomalregions.Thegreatestgainsinresolution,intermsof identifying new recombinant haplotypes and haplotype groupings, occurinYRI. Consequently,thePhaseIIHapMapprovides increased resolution in the estimated fine-scale genetic map and improved power to detect and localize recombination hotspots (Fig. 2b). The use of the Phase II HapMap in association studies The increased SNP density of the Phase II HapMap has already been extensively exploited in genome-wide studies of disease association. In this section, we quantify the gain in resolution and outline how the HapMap data can be used to improve the power of association studies. Improved coverage of common variation. We previously predicted thatthevast majority ofcommonSNPswouldbe correlated toPhase II HapMap SNPs by extrapolation from the ten HapMap ENCODE regions 3 . Using the actual Phase II marker spacing and frequency distributions (Table 2), we repeated the simulations and estimate that Phase II HapMap marker sets capture the overwhelming ma- jority of all common variants at high r 2 . For common variants (MAF$0.05) the mean maximum r 2 of any SNP to a typed one is 0.90 in YRI, 0.96 in CEU and 0.95 in CHB1JPT. The impact of the 10,000 10,020 10,040 10,060 10,080 10,100 Position NCBI build 35 (kb) b 0 100 Position (Mb) Chr o mosome 50 150 a >2.5 2.0?2.5 1.5?2.0 1.25?1.5 1.0?1.25 0.75?1.0 0.5?0.75 0.1?0.5 <0.1 ?100?80 ?60 ?40 ?20 0 20 40 60 80 100 0.8 0.9 1.0 1.1 Position (kb) HapMap polymorphic SNP density (per kb) 1.7 1.8 1.9 2.0 2.1 2.2 dbSNP polymorphic SNP density (per kb) c 250200 10 15 20 5 X 1 2 3 21 19 18 17 16 14 13 12 11 9 22 8 7 6 4 Figure 1 | SNP density in the Phase II HapMap. a, SNP density across the genome. Colours indicate the number of polymorphic SNPs per kb in the consensus data set. Gaps in the assembly are shown as white. b, Example of the fine-scale structure of SNP density for a 100-kb region on chromosome 17 showing Perlegen amplicons (black bars), polymorphic Phase I SNPs in the consensus data set (redtriangles) and polymorphic Phase II SNPs in the consensusdataset(bluetriangles).NotetherelativelyevenspacingofPhase I SNPs. c, The distribution of polymorphic SNPs in the consensus Phase II HapMap data (blue line and left-hand axis) around coding regions. Also shown is the density of SNPs in dbSNP release 125 around genes (red line and right-hand axis). Values were calculated separately 59 from the coding start site (the left dotted line) and 39from the coding end site (right dotted line) and were joined at the median midpoint position of the coding unit (central dotted line). cM Mb ?1 5160000 5180000 5200000 Position (NCBI build 35) 5220000 5240000 OR51V1 HBB HBE HBD HBG1 HBG2 0 20 40 60 80 a b Figure 2 | Haplotype structure and recombination rate estimates from the Phase II HapMap. a, Haplotypes from YRI in a 100kb region around the b-globin (HBB) gene. SNPs typed in Phase I are shown in dark blue. AdditionalSNPsinthePhaseIIHapMapareshowninlightblue.OnlySNPs for which the derived allele can be unambiguously identified by parsimony (by comparisonwith anoutgroupsequence)are shown(89% ofSNPs inthe region); the derived allele is shown in colour. b, Recombination rates (lines) and the location of hotspots (horizontal blue bars) estimated for the same region from the Phase I (dark blue) and Phase II HapMap (light blue) data. Also shown are the location of genes within the region (grey bars) and the location of the experimentally verified recombination hotspot 57,58 at the 59 end of the HBB gene (black bar). NATURE|Vol 449|18 October 2007 ARTICLES 853 Nature �2007 Publishing Group increased density of the Phase II HapMap is most notable in YRI (in thePhaseIHapMapthemeanmaximumr 2 was0.67).Similarresults are found if a threshold of r 2 $0.8 is used to determine whether an SNP is captured (Table 2). As expected, very common SNPs with MAF.0.25 are captured extremely well (mean maximum r 2 of 0.93 in YRIto 0.97 inCEU), whereas rarer SNPs with MAF,0.05are less well covered (mean maximum r 2 of 0.74 in CHB1JPT to 0.76 in YRI). The latter figure is probably an overestimate because it is based on lower frequency SNPs discovered via re-sequencing 48 HapMap individuals, and does not include a much larger number of very rare SNPs. We also assessed the increase in coverage provided by using two-SNP haplotypes as proxies for SNPs that are poorly captured by single SNPs 16 (Table 2). These two-SNP haplotypes lead to a modest increaseinmeanmaximumr 2 of0.01to0.03acrossallallelefrequen- cies. However, in some regions, particularly where marker density is low, gains from multi-marker and imputation approaches in prac- tical situations can be substantial (see below). Currently,thePhaseIIHapMapprovidesthemostcompleteavail- able resource for selecting tag SNPs genome-wide. Using a simple pairwise tagging approach, we find that 1.09 million SNPs are required to capture all common Phase II SNPs with r 2 $0.8 in YRI, with slightly more than 500,000 required in CEU and CHB1JPT (Table 3). These numbers are approximately twice those required to capture SNPs in the Phase I HapMap (which has one- third as many SNPs). The number of SNPs required to achieve per- fect tagging (r 2 51.0) in each analysis panel is almost double that required to achieve the r 2 $0.8 threshold. It becomes increasingly expensive to improve the coverage afforded by tags from the Phase I and, now, the Phase II HapMap, because additional tag SNPs are unlikely to capture large groups of additional SNPs. Phase II HapMap and genome-wide association studies. Although theefficientchoiceoftagSNPsisoneuseofthePhaseIIHapMap,for most disease studies the tag SNPs genotyped will be primarily deter- minedbythechoiceofacommercialplatformfortheexperiment 17,18 . Using Phase II data, we estimated the coverage of several available products on which genome-wide association studies are already underway (Table 4). Similar to earlier estimates 17,18 , these products typically perform well in CEU and CHB1JPT, and some also per- formwellinYRI.Forexample,arraysofapproximately500,000SNPs capture 68?88% (depending on selection method) of all HapMap PhaseIIvariationwithr 2 $0.8inCEU.SNPsthatarenotincludedin the Phase II HapMap will be covered more poorly because most genotyping products were designed using HapMap data. HapMap data have several additional roles in the analysis of dis- ease-association studies using fixed marker sets. For example, the high-quality haplotype information within the Phase II HapMap can be used to aid the phasing of genotype data from new samples becauseadditionalhaplotypesarelikelytobelocallyverysimilartoat least one haplotype in the Phase II data. By a similar argument, missing genotypes can potentially be inferred through comparison to the Phase II haplotypes. Genotypes may be missing either because of genotyping failure or because the SNP was not assayed within the experiment. Therefore, the HapMap haplotypes provide a way of in silico genotyping Phase II SNPs that were not included in the experiment. Although there is no clear consensus yet about the role of SNP imputation in the analysis of genome-wide association studies, high imputation accuracy can be achieved using model-based meth- ods 19?23 and can lead to an increase in power 23,24 . To illustrate the possibilities, in the 500-kb HapMap ENCODE region on 8q24.11 (Supplementary Fig. 5) we evaluated imputation of Phase II SNPs from the Affymetrix GeneChip 500K array. To do this, we used a Table 2 | Estimated coverage of the Phase II HapMap in the ten HapMap ENCODE regions Panel MAF bin Phase I HapMap 3 Phase II HapMap Pairwise linkage disequilibrium Additional 2-SNP tests r 2 $0.8 (%) Mean maximum r 2 r 2 $0.8 (%) Mean maximum r 2 r 2 $0.8 (%) Mean maximum r 2 YRI $0.05 45 0.67 82 0.90 87 0.93 ,0.05 61 0.76 62 0.78 0.05?0.10 81 0.89 81 0.89 0.10?0.25 90 0.94 90 0.95 0.25?0.50 87 0.93 92 0.96 CEU $0.05 74 0.85 93 0.96 95 0.97 ,0.05 70 0.79 72 0.81 0.05?0.10 87 0.92 88 0.93 0.10?0.25 94 0.96 95 0.97 0.25?0.50 95 0.97 97 0.98 CHB1JPT $0.05 72 0.83 92 0.95 95 0.97 ,0.05 65 0.74 65 0.74 0.05?0.10 81 0.89 82 0.89 0.10?0.25 90 0.94 90 0.95 0.25?0.50 94 0.96 97 0.98 2-SNP tests, linkage disequilibrium to haplotypes formed from two nearby SNPs. Table 4 | Estimated coverage of commercially available fixed marker arrays Platform* YRI CEU CHB1JPT r 2 $0.8 (%) Mean maximum r 2 r 2 $0.8 (%) Mean maximum r 2 r 2 $0.8 (%) Mean maximum r 2 Affymetrix GeneChip 500K 46 0.66 68 0.81 67 0.80 Affymetrix SNP Array 6.066 .80 82 0.90 81 0.89 Illumina HumanHap300 33 0.56 77 0.86 63 0.78 Illumina HumanHap550 55 0.73 88 0.92 83 0.89 Illumina HumanHap650Y 66 0.80 89 0.93 84 0.90 Perlegen 600K 47 0.68 92 0.94 84 0.90 *Assuming all SNPs on the product are informative and pass QC; in practice these numbers are overestimates. Table 3 | Number of tag SNPs required to capture common (MAF$0.05) Phase II SNPs Threshold YRI CEU CHB1JPT r 2 $0.5627,458 290,969 277,831 r 2 $0.81,093,422 552,853 520,111 r 2 51.01,616,739 1,024,665 1,078,959 ARTICLES NATURE|Vol 449|18 October 2007 854 Nature �2007 Publishing Group leave-one-out procedure to assess the accuracy of genotype predic- tion in the YRI. For SNPs with MAF$0.2, the average maximum r 2 toatypedSNPintheregionis0.59comparedtoanaveragegenotype prediction r 2 of 0.86. Furthermore, whereas 44% of such SNPs in the region have no single-marker proxy with r 2 $0.5, fewer than 6% of the SNPs have a genotype imputation accuracy of r 2 ,0.5, establish- ing that accurate imputation can be achieved even in the population where linkage disequilibrium is the weakest. New insights into linkage disequilibrium structure The paradigm underlying association studies is that linkage disequi- librium can be used to capture associations between markers and nearby untyped SNPs. However, the Phase II HapMap has revealed several properties of linkage disequilibrium that illustrate the full complexity of empirical patterns of genetic variation. Two striking features are the long-range similarity among haplotypes, and SNPs that show almost no linkage disequilibrium with any other SNP. The extent of recent common ancestry and segmental sharing. A simplified view of linkage disequilibrium is that genetic variation is organized in relatively short stretches of strong linkage disequilib- rium (haplotype blocks), each containing only a few common hap- lotypes and separated by recombination hotspots across which little association remains 25 .Althoughthisviewhasheuristicvalue,ifchro- mosomes share a recent common ancestor then similarity between chromosomescanextendoverconsiderablegeneticdistanceandspan multiplerecombinationhotspots 26 .Theextentofsuchrecentancestry in the four populations surveyed here has not been characterized previously.Thereforeweidentifiedstretchesofidentitybetweenpairs of chromosomes, both within and across individuals, reflecting auto- zygosity and identity-by-descent (IBD) (Fig. 3a). After first checking forstratificationwithineachanalysispanel(seeSupplementaryText3; none was found for YRI, CEU and JPT, and only small stratification was found for CHB), we calculated genome-wide probabilities of sharing 0, 1 or 2 chromosomes identical by descent for each pair of individuals (see Supplementary Text 4). In addition to identifying a fewcloserelationships(asreportedinHapMapPhaseI 3 ),weestimate that, on average,any twoindividuals from thesamepopulationshare approximately 0.5% of their genome through recent IBD (Table 5). Using a hidden Markov model approach 27 (see Supplementary Text 5),wesearched forsuch shared segments over1-megabase (Mb) long and containing at least 50 SNPs, after first pruning the list of SNPs to remove local linkage disequilibrium. We find that 10?30% of pairs in each analysis panel share regions of extended identity resulting from sharing a common ancestor within 10?100 generations. These regions typically span hundreds of SNPs and can extend over tens of megabases (Table 5). Similarly, extended stretches of homozygosity are indicative of recent inbreeding within populations 28,29 . Although short runs of homozygosity are commonplace, covering up to one-third of the genome and showing population differences reflective of ancient linkage disequilibrium patterns (Table 5 and Fig. 3b), very long homozygous runs exist that are clearly distinct from this process. Including two JPT individuals who have unusually high levels of homozygosity (NA18987 and NA18992) and one CEU individual (NA12874), we identified 79 homozygous regions over 3Mb in 51 individuals,withmanysegmentsextendingover10Mb(Supplemen- tary Tables 7 and 8). Segments intersecting with suspected deletions were first removed from the analysis (Supplementary Text 6). Instudiesofraremendeliandiseases,theextendedhaplotypeshar- ing surrounding recent mutations, usually with a frequency of much less than 1%, has been exploited to great advantage through homo- zygosity mapping 30,31 and haplotype sharing 32 methods. In studies of commondisease,extendedhaplotypesharingamongpatientspoten- tially offers a route for identifying rare variants (MAF in the range of 1?5%) of high penetrance 33,34 , which tend to be poorly captured through single-marker association with genome-wide arrays. To illustrate the idea, we identified SNPs where only two copies of the minor allele are present (referred to as ?2-SNPs?), which have minor allele frequencies of 1?2%. We find that these are enriched approxi- mately sevenfold (Table 5) among regions of IBD identified by the hidden Markov model approach. Notably, identification of IBD re- gions can be performed with the same genome-wide SNP data being T otal length (Mb) Chromosome 12345678910111213141516171819202122X 0 10 20 30 40 50 60 70 b a * Physical position Chr omosome NA19130 NA19192 (YRI) P IBD1 = 0.48 52 segments, 1,330.8 Mb NA06994 NA12892 (CEU) P IBD1 = 0.06 12 segments, 152.1 Mb NA12006 NA12155 (CEU) P IBD1 = 0.01 1 segment, 7.6 Mb Figure 3 | The extent of recent co-ancestry among HapMap individuals. a,Threepairsofindividualswithvaryinglevelsofidentity-by-descent(IBD) sharing illustrate the continuum between very close and very distant relatedness and its relation to segmental sharing. The three pairs are: high sharing (NA19130 and NA19192 from YRI; previously identified as second- degree relatives 3 ), moderate sharing (NA06994 and NA12892 from CEU) and low sharing (NA12006 and NA12155 from CEU). Along each chromosome, the probability of sharing at least one chromosome IBD is plotted,basedontheHMMmethoddescribedinSupplementaryText5.Red sectionsindicateregionscalledassegments:ingeneral,theproportionofthe genome in segments is similar to each pair?s estimated global relatedness. b, The extent of homozygosity on each chromosome for each individual in each analysis panel. Excludes segments,106kb and chromosome X in males.Asterisk,NA12874,length5107Mb.YRI,green;CEU,orange;CHB, blue; JPT, magenta. Table 5 | Relatedness, extended segmental sharing and homozygosity Property YRI CEU CHB JPT Number of pairs included 1,767 1,708 990 861 Mean identity by state (IBS) (%) 81.983.785.085.1 Mean identity by descent (IBD) (%) 0.04 0.34 0.36 0.42 Number of pairs with.1% IBD (%) 8.820.421.129.7 Number of pairs with one or more segment (%) 195 (11.0) 350 (20.5) 135 (13.6) 216 (25.1) Total number of segments 250 427 146 273 Total distance spanned (Mb) 1,416 2,336 704 1,301 Mean segment length (Mb) 5.75.54.84.8 Maximum segment length (Mb) 51.756.215.025.3 Maximum segment length (Mb) (including close relatives) 141.4128.5 N/A N/A Total number of 2-SNPs 6,219 9,220 8,174 8,750 Number of 2-SNPs in segments 109 162 116 132 2-SNP fold increase 6.77.37.67.0 Number of homozygous segments (310 3 )* 0.92.22.62.6 SNPs in homozygous segments (310 5 ) 1.64.25.35.4 Total length of homozygous segments (Mb) 160 410 510 520 2-SNP, SNPs where only two copies of the minor allele are present. *Homozygous segments.106kb. NATURE|Vol 449|18 October 2007 ARTICLES 855 Nature �2007 Publishing Group collected in large-scale association studies, making haplotype- sharing approaches an attractive and complementary analysis to standard SNP association tests, with the potential to identify rare variants associated with complex disease. The distribution and causes of untaggable SNPs. Despite the SNP density of the Phase II HapMap, there are high-frequency SNPs for which no tag can be identified. Among high-frequency SNPs (MAF$0.2), we marked as untaggable SNPs to which no other SNPwithin 100kb hasanr 2 valueofatleast 0.2.InPhaseII, approxi- mately 0.5?1.0% of all high-frequency SNPs are untaggable and the proportion in YRI is approximately twice as high as in the other panels. Similar proportions are observed across the ten HapMap ENCODE regions. To identify factors influencing the location of untaggable SNPs we considered their distribution relative to segmental duplications, repeat sequence, CpG dinucleotide density, regions of low SNP den- sity, unusual allele frequency distribution, linkage disequilibrium patterns and recombination hotspots. We find no evidence for an enrichment of untaggable SNPs in segmental duplications or repeat sequence, as would be expected from mis-mapping of SNPs (2% and 35% of common SNPs lie in segmental duplications and repeat sequence, respectively, compared to 1.8% and 29%, respectively, of untaggable SNPs). Untaggable SNPs are slightly enriched in CpG islands (0.37% of common SNPs are in CpG islands compared to 1.4% of untaggable SNPs) and have slightly reduced MAF (Fig. 4). Most notably, untaggable SNPs are strongly enriched in regions of low linkage disequilibrium, particularly in recombination hotspots. TotestwhethertheseuntaggableSNPsarethemselvesresponsiblefor the identification of recombination hotspots, we eliminated them from 100 randomly chosen recombination hotspots and reassessed theevidenceforalocalpeakinrecombination.Inallcaseswestillfind evidence for a considerable increase in local recombination rate. Over 50% of all untaggable SNPs lie within 1kb of the centre of a detected recombination hotspot and over 90% are within 5kb. Because only 3?4% of all SNPs lie within 1kb from the centre of a detected recombination hotspot (16% are within 5kb), this consti- tutes a marked enrichment and implies that at least 10% of all SNPs within 1kb of hotspots are untaggable. The implication for asso- ciation mapping is that when a region of interest contains a known hotspot it may be prudent to perform additional sequencing within the hotspot. Many of the variants identified in this manner will be untaggable SNPs that should be genotyped directly in association studies. From a biological perspective, the proximity of untaggable SNPs to the centre of hotspots suggests that they may lie within gene conversion tracts associated with the repair of double-strand breaks. Double-strand breaks are thought to resolve as crossover events only 5?25% of the time 35 . Consequently, SNPs lying near the centre of a hotspot are liable to be included within gene conversion tracts and will experience much higher effective recombination rates than pre- dicted from crossover rates alone. The distribution of recombination In the Phase II HapMap we identified 32,996 recombination hot- spots 3,6,36 (an increase of over 50% from Phase I) of which 68% localized to a region of#5kb. The median map distance induced by a hotspot is 0.043cM (or one crossover per 2,300 meioses) and the hottest identified, on chromosome 20, is 1.2cM (one crossover per 80 meioses). Hotspots account for approximately 60% of re- combination in the human genome and about 6% of sequence (Supplementary Fig. 6). We do not find marked differences among chromosomes in the concentration of recombination in hotspots, which implies that obligate differences in recombination among chromosomes of different size result from differences in hotspot density and intensity 6 . Theincreasednumberofwell-definedhotspotsallowsustounder- stand better the influence of genomic features on the distribution of recombination. Previous work identified specific DNA motifs that influence hotspot location 6,37 as well as additional influences of local sequence context including the location of genes 6 and base composi- tion 38 .ThePhaseIIHapMapprovidestheresolutiontoseparatethese influences. Figure 5a shows the distribution of recombination, hot- spot motifs and base composition around genes. Within the tran- scribed region of genes there is a marked decrease in the estimated recombination rate. However, 59 of the transcription start site is a peakinrecombinationratewithacorrespondinglocalincreaseinthe density of hotspot motifs. This region also shows a marked increase in G1C content, reflecting the presence of CpG islands in promoter regions. There is also an asymmetry in recombination rate across genes, with recombination rates 39 of transcribed regions being ele- vated(asaremotifdensityandG1Ccontent)comparedtoregions59 of genes. Studies in yeast have previously suggested an association betweenpromoterregionsandrecombinationhotspots 39 .Ourresults suggestasignificant,althoughweak,relationshipbetweenpromoters and recombination in humans. Nevertheless, the vast majority of hotspots in the human genome are not in gene promoters. The asso- ciationmayreflectageneralassociationbetweenregionsofaccessible chromatin and crossover activity. Systematic differences in recombination rate by gene class. Previous work has demonstrated differences in the magnitude of linkage disequilibrium, as measured at a megabase scale, among genesassociatedwithdifferentfunctions 3,40 .Usingthefine-scalegen- etic map estimated from the Phase II HapMap data we can quantify local increases in recombination rate associated with genes of differ- ent function using the Panther gene ontology annotation 41 . Average recombination rates vary more than sixfold among such gene classes(Fig.5b),withdefenceandimmunitygenesshowingthehigh- est rates (1.9cMMb 21 ) and chaperones showing the lowest rates (0.3cMMb 21 ). Gene functions associated with cell surfaces and externalfunctionstendtoshowhigherrecombinationrates(immun- ity, cell adhesion, extracellular matrix, ion channels, signalling) whereas those with lower recombination rates are typically internal tocells(chaperones,ligase,isomerase,synthase).Controllingforsys- tematic differences between gene classes in base composition and gene clustering, the differences between groups remain significant. Position (kb) SNPs per kb 0.20 0.30 0.20 0.10 0 1.4 0.8 0.6 0.4 0.2 0 1.3 1.2 1.1 1.0 0.10 0 30 20 10 0 MAF Max. r 2 Hotspots per kb cM Mb ?1 0?1 10?100 5?10 1?50?1 10?100 5?10 1?5 Position (kb) 0?1 10?100 5?10 1?50?1 10?100 5?10 1?5 Position (kb) 0?1 10?100 5?10 1?50?1 10?100 5?10 1?5 Position (kb) 0?1 10?100 5?10 1?50?1 10?100 5?10 1?5 Position (kb) 0?1 10?100 5?10 1?50?1 10?100 5?10 1?5 acb ed SNP density Allele frequency Linkage disequilibrium Recombination rateHotspots Figure 4 | Properties of untaggable SNPs. a?e, Properties of the genomic regions surrounding untaggable SNPs in terms of: a, the density of polymorphic SNPs within the consensus data set; b, mean minor allele frequencyofpolymorphicSNPs;c,maximumr 2 ofSNPstoanyothersinthe Phase II data; d, the density of estimated recombination hotspots (defined from hotspot centres); and e, the estimated mean recombination rate. YRI, green; CEU, orange; CHB1JPT, purple. ARTICLES NATURE|Vol 449|18 October 2007 856 Nature �2007 Publishing Group Wealsofindthatthedensityofhotspot-associatedDNAmotifsvaries systematically among gene classes and that variation in motif density explains over 50% of the variance in recombination rate among gene functions (Supplementary Fig. 7). These results pose interesting evolutionary questions. Because recombinationinvolvesDNAdamagethroughdouble-strandbreaks, hotspots may be selected against in some highly conserved parts of the genome. In regions exposed to recurrent selection (for example, from changes in environment or pathogen pressure) it is plausible that recombination may be selected for. However, because the fine- scale structure of recombination seems to evolve rapidly 42,43 it will be important to learn whether patterns of recombination rate hetero- geneity among molecular functions are conserved between species. Natural selection ThePhaseIHapMapdatahavebeenusedtoidentifygenomicregions that show evidence for the influence of adaptive evolution 3,9 , prim- arily through extended haplotype structure indicative of recent posi- tive selection. Using two established approaches 9,44 , we identified approximately 200 regions with evidence of recent positive selection from the Phase II HapMap (Supplementary Table 9). These regions include many established cases of selection, such as the genes HBB and LCT, the HLA region, and an inversion on chromosome 17. Many other regions have been previously identified in HapMap Phase I including LARGE, SYT1 and SULT1C2 (previously called SULT1C1). A detailed description of the findings from the Phase II HapMap is published elsewhere 45 . The Phase II HapMap also provides new insights into the forces acting on SNPs in coding regions. Effort was made to genotype as many known or putative non-synonymous SNPs as possible. Of the 56,789 non-synonymous SNPs identified in dbSNP release 125, attempts were made to genotype 36,777, which resulted in 17,427 that are QC1 in all three analysis panels and polymorphic. We selected only those SNPs for which ancestral allele information was available (approximately 90%). For comparison, we used patterns of variation at synonymous SNPs. As previously reported 46,47 , non- synonymousSNPsshowanincreaseinfrequencyofrarevariantsand a slight decrease of common variants compared to synonymous SNPs, compatible with widespread purifying selection against non- synonymous mutations (Fig. 6a). In contrast, we find no excess of high-frequency derived non-synonymous mutations, as might be expected if positive selection were widespread. Naturalselection alsoinfluencestheextenttowhichallelefrequen- cies differ betweenpopulations, not only through local selective pres- sures that drive alleles to different frequencies 48,49 , but also through localvariationinthestrengthofpurifyingselection.Wecomparedthe distribution of population differentiation (as measured by F ST , the proportion of total variation in allele frequency that is due to differ- encesbetweenpopulations)atnon-synonymousSNPsandsynonym- ousSNPs matched forallelefrequency (Fig. 6b). Wefind a systematic bias fornon-synonymous SNPs to show stronger differentiationthan synonymous SNPs. Among SNPs showing high levels of differenti- ation there is a strong tendency for the derived allele to be at higher frequency in non-YRI populations. Among SNPs with F ST .0.5 between CEU and YRI, in 79% and 75% of non-synonymous and synonymousvariants,respectively,thederivedalleleismorecommon in CEU. Although this difference between non-synonymous and synonymous SNPs is not significant, among the eight exonic SNPs with F ST .0.95, all are non-synonymous. We see no such bias towards increased MAF in CEU at high-differentiation SNPs, indi- cating that SNP ascertainment is unlikely to explain the difference. Rather, this effect can largely be explained by more genetic drift in the non-African populations, as confirmed by simulations (data not shown). In addition, reduced selection against deleterious muta- tions and local adaptation within non-African populations will both act to increase the frequency of derived variants in non-African populations. Toassess the evidence for widespread local adaptation influencing non-synonymous mutations we considered the distribution of integrated extended haplotype homozygosity (iEHH) statistics 9,44 (Fig. 6c). We find no evidence for systematic differences between non-synonymous and synonymous SNPs, suggesting that local adaptation does not explain their higher differentiation. Although hitch-hiking effects will tend to obscure differences between selected 1.31.1 2.10.7 0.9 0.0001 0.001 0.01 0.1 1 0.1 0.01 0.001 0.0001 Significance level (P) 0.5 1.91.5 1.7 Mean recombination rate within category (cM Mb ?1 ) Defence/immunity protein (269) Cell adhesion molecule (268) Extracellular matrix (262) Ion channel (264) Signalling molecule (627) Protease (394) Receptor (1,158) Transporter (474) Select calcium-binding protein (190) Cell junction protein (77) Hydrolase (508) Cytoskeletal protein (547) Miscellaneous function (591) Transfer/carrier protein (230) Transcription factor (1,322) Oxidoreductase (461) Select regulatory molecule (821) Transferase (614) Membrane traffic protein (249) Lyase (112) Phosphatase (197) Kinase (513) Nucleic acid binding (1,567) Synthase and synthetase (170) Isomerase (107) Ligase (305) Chaperone (128) 0.07 0.09 0.11 0.13 0.15 200 150 100 50 0 50 100 150 200 Position (kb) 0.45 0.50 0.55 0.60 G+C content Motif density (motifs per kb) Recombination rate (cM Mb ?1 ) 1.3 1.5 1.7 1.1 ab Figure 5 | Recombination rates around genes. a, The recombination rate, density of recombination-hotspot-associated motifs (all motifs with up to 1 bp different from the consensus CCTCCCTNNCCAC) and G1C content around genes. The blueline indicates the mean. For the recombination rate, grey lines indicate the quartiles of the distribution. Values were calculated separately 59 from the transcription start site (the first dotted line) and 39 from the transcription end site (third dotted line) and were joined at the median midpoint position of the transcription unit (central dotted line). Notethesharpdropinrecombinationratewithinthetranscriptionunit,the localincreasearoundthetranscriptionstartsiteandthebroaddecreaseaway fromthe39endofgenes.Thesepatternsonlypartlyreflectthedistributionof G1C content and the hotspot-associated motif, suggesting that additional factorsinfluencerecombinationratesaroundgenes.b,Recombinationrates within genes of different molecular function 41 . The chart shows the increase ordecreaseforeachcategorycomparedtothegenomeaverage.Pvalueswere estimated by permutation of category; numbers of genes are shown in parentheses. NATURE|Vol 449|18 October 2007 ARTICLES 857 Nature �2007 Publishing Group andneutralSNPs,theseresultsareconsistentwithascenarioinwhich the higher differentiation of non-synonymous SNPs is primarily dri- ven by a reduction inthe strength or efficacy ofpurifying selection in non-African populations. Discussion and prospects The International HapMap Project has been instrumental in making well-powered, large-scale, genome-wide association studies a reality. ItisnowclearthattheHapMapcanbeausefulresourceforthedesign and analysis of disease association studies in populations across the world 50?53 . Furthermore, the decreasing costs and increasing SNP density of standard genotyping panels mean that the focus of atten- tion in disease association studies is shifting from candidate gene approaches towards genome-wide analyses. Alongside developments in technology, new statistical methodologies aimed at improving aspects of analysis, such as genotype calling 21,54 , the identification of and correction for population stratification and relatedness 55,56 , and imputation of untyped variants 21?23 , are increasing the accuracy and reliability of genome-wide association studies. Within this context, it is important to consider the future of the HapMap Project. Currently, additional samples from the popula- tions used to develop the initial HapMap, as well as samples from seven additional populations (Luhya in Webuye, Kenya; Maasai in Kinyawa,Kenya;TuscansinItaly;GujaratiIndianinHouston,Texas, USA; Denver (Colorado) metropolitan Chinese community; people of Mexican origin in Los Angeles, California, USA; and people with African ancestry in the southwestern United States; http://ccr.coriell. org/Sections/Collections/NHGRI/?SsId511) will be sequenced and genotypedextensivelytoextendtheHapMap,providinginformation on rarer variants and helping to enable genome-wide association studies in additional populations. There are also ongoing efforts by many groups to characterize additional forms of genetic variation, such as structural variation, and molecular phenotypes in the HapMap samples. Finally, in the future, whole-genome sequencing will provide a natural convergence of technologies to type both SNP and structural variation. Nevertheless, until that point, and even after, the HapMap Project data will provide an invaluable resource for understanding the structure of human genetic variation and its link to phenotype. METHODS SUMMARY Of approximately 6.9 million SNPs in dbSNP release 122 approximately 4.7 millionwereselectedforgenotypingbyPerlegen.2.5millionSNPswereexcluded becausenoassaycouldbedesignedandafurther350,000wereexcludedforother reasons (see Methods). Perlegen performed genotyping using custom high- density oligonucleotide arrays as previously described 15 . Additional genotype submissions are described in the text. QC filters were applied as previously described 3 .WheremultiplesubmissionsmettheQCcriteriathesubmissionwith the lowest missing data rate was chosen for inclusion in the non-redundant filtered data set. Haplotypes were estimated from genotype data as described previously 3 . Ancestral states at SNPs were inferred by parsimony by comparison to orthologous bases in the chimpanzee (panTro2) and rhesus macaque (rheMac2) assemblies. Recombination rates and the location of recombination hotspots were estimated as described previously 3 . Additional details can be found in the Methods section and the Supplementary Information. The data described in this paper are in release 21 of the International HapMap Project. Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature. Received 12 April; accepted 18 September 2007. 1. The International HapMap Consortium. Integrating ethics and science in the International HapMap Project. Nature Rev. Genet. 5, 467?475 (2004). 2. The International HapMap Consortium. The International HapMap Project. Nature 426, 789?796 (2003). 3. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299?1320 (2005). 4. Bowcock, A. M. Genomics: guilt by association. Nature 447, 645?646 (2007). 5. Altshuler, D. & Daly, M. Guilt beyond a reasonable doubt. Nature Genet. 39, 813?815 (2007). 6. Myers, S., Bottolo, L., Freeman, C., McVean, G. & Donnelly, P. A fine-scalemap of recombination rates and hotspots across the human genome. Science 310, 321?324 (2005). 7. McCarroll, S. A. et al. Common deletion polymorphisms in the human genome. Nature Genet. 38, 86?92 (2006). 8. Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. A high- resolution survey of deletion polymorphism in the human genome. Nature Genet. 38, 75?81 (2006). 9. Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006). 10. Redon,R.etal.Globalvariationincopynumberinthehumangenome.Nature444, 444?454 (2006). 11. de Bakker, P. I. et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nature Genet. 38, 1166?1172 (2006). 12. Pastinen, T. et al. Mapping common regulatory variants to human haplotypes. Hum. Mol. Genet. 14, 3963?3971 (2005). 13. Stranger, B. E. et al. Genome-wide associations of gene expression variation in humans. PLoS Genet. 1, e78 (2005). 14. Cheung, V. G. et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365?1369 (2005). 15. Hinds, D. A. et al. Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072?1079 (2005). 16. de Bakker, P. I. et al. Efficiency and power in genetic association studies. Nature Genet. 37, 1217?1223 (2005). 17. Pe?er, I. et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genet. 38, 663?667 (2006). 18. Barrett, J. C. & Cardon, L. R. Evaluating coverage of genome-wide association studies. Nature Genet. 38, 659?662 (2006). 19. Burdick, J. T., Chen, W. M., Abecasis, G. R. & Cheung, V. G. In silico method for inferring genotypes in pedigrees. Nature Genet. 38, 1002?1004 (2006). 20. Servin, B. R. & Stephens, M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 3, e114 (2007). a bc CEU CHB+JPT DAF YRI DAF 0 0.10 0.20 0 0.10 0.20 0.08 1.0 0.8 0.6 0.4 0.2 0 0.04 0 0 0.10 0.20 Pr oportion 0 0.2 0.4 0.6 1.00.8 0 0.2 0.4 0.6 1.00.8 0 0.2 0.4 0.6 1.00.8 0 0.2 0.4 0.6 1.00.8 DAF 0 0.2 0.4 0.6 1.00.8 Average DAF across panels Pr oportion of SNPs with F ST > 0.5 DAF Pr oportion of top 5% iEHH non-synonymous SNPs * ** ** * Figure 6 | Properties of non-synonymous and synonymous SNPs. a, The derived allele frequency (DAF) spectrum in each analysis panel for all SNPs (black), synonymous SNPs (green) and non-synonymous SNPs (red). Note the excess of rare variants for coding sequence SNPs but no excess of high- frequencyderivedvariants.b,Enrichmentofnon-synonymousSNPsamong genic SNPs showing high differentiation. For each of ten classes of derived allele frequency (averaged across analysis panels) the fraction of non- synonymous (red) and synonymous (green) variants in that class that show F ST .0.5 is shown. Note the strong enrichment of non-synonymous SNPs among SNPs of moderate to high derived-allele frequency (asterisk, P,0.05; double asterisk, P,0.01). c, Lack of enrichment of non- synonymous SNPs among those showing long-range haplotype structure. The integrated extended haplotype homozygosity (iEHH) statistic 9 was calculated for non-synonymous and synonymous SNPs in each analysis panel(YRI,green;CEU,orange;CHB1JPT,purple).Foreachoftenderived allele frequency classes, the proportion of non-synonymous SNPs among those showing the 5% most extreme statistics (within the allele frequency class) is shown (points). Also shown is the proportion of non-synonymous SNPs among SNPs in the coding sequence for each frequency class (dotted lines). Differences between synonymous and non-synonymous SNPs are tested for using a contingency table test. ARTICLES NATURE|Vol 449|18 October 2007 858 Nature �2007 Publishing Group 21. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661?668 (2007). 22. Scott, L. J. et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 1341?1345 (2007). 23. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint methodforgenome-wideassociationstudiesviaimputationofgenotypes.Nature Genet. 39, 906?913 (2007). 24. Chapman, J. M., Cooper, J. D., Todd, J. A. & Clayton, D. G. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum. Hered. 56, 18?31 (2003). 25. Paabo, S. The mosaic that is our genome. Nature 421, 409?412 (2003). 26. McVean, G., Spencer, C. C. & Chaix, R. Perspectives on human genetic variation from the HapMap Project. PLoS Genet. 1, e54 (2005). 27. Purcell, S. et al. PLINK: a toolset for whole-genome association and population- based linkage analysis. Am. J. Hum. Genet. 81, 559?575 (2007). 28. Broman, K. W. & Weber, J. L. Long homozygous chromosomal segments in reference families from the centre d?Etude du polymorphisme humain. Am. J. Hum. Genet. 65, 1493?1500 (1999). 29. Gibson, J.,Morton, N.E. & Collins, A. Extended tracts of homozygosity in outbred human populations. Hum. Mol. Genet. 15, 789?795 (2006). 30. Lander, E. S. & Botstein, D. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science 236, 1567?1570 (1987). 31. Leutenegger, A. L. et al. Using genomic inbreeding coefficient estimates for homozygosity mapping of rare recessive traits: application to Taybi-Linder syndrome. Am. J. Hum. Genet. 79, 62?66 (2006). 32. Te Meerman, G. J., Van der Meulen, M. A. & Sandkuijl, L. A. Perspectives of identity by descent (IBD) mapping in founder populations. Clin. Exp. Allergy 25 (Suppl 2), 97?102 (1995). 33. Houwen, R. H. et al. Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis. Nature Genet. 8, 380?386 (1994). 34. Durham, L. K. & Feingold, E. Genome scanning for segments shared identical by descent among distant relatives in isolated populations. Am. J. Hum. Genet. 61, 830?842 (1997). 35. Jeffreys,A.J.&May,C.A.Intenseandhighlylocalizedgeneconversionactivityin human meiotic crossover hot spots. Nature Genet. 36, 151?156 (2004). 36. McVean,G.A.etal.Thefine-scalestructureofrecombinationratevariationinthe human genome. Science 304, 581?584 (2004). 37. Myers,S.etal.Thedistributionandcausesofmeioticrecombinationinthehuman genome. Biochem. Soc. Trans. 34, 526?530 (2006). 38. Spencer, C. C. et al. The influence of recombination on human genetic diversity. PLoS Genet. 2, e148 (2006). 39. Petes, T. D. Meiotic recombination hot spots and cold spots. Nature Rev. Genet. 2, 360?369 (2001). 40. Smith, A. V., Thomas, D. J., Munro, H. M. & Abecasis, G. R. Sequence features in regions of weak and strong linkage disequilibrium. Genome Res. 15, 1519?1534 (2005). 41. Thomas, P. D. et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129?2141 (2003). 42. Winckler, W. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308, 107?111 (2005). 43. Ptak, S. E. et al. Fine-scale recombination patterns differ between chimpanzees and humans. Nature Genet. 37, 429?434 (2005). 44. Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832?837 (2002). 45. Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature doi:10.1038/nature06250 (this issue). 46. Bustamante, C. D. et al. Natural selection on protein-coding genes in the human genome. Nature 437, 1153?1157 (2005). 47. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet. 22, 231?238 (1999). 48. Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D. Interrogating a high- density SNP map for signatures of natural selection. Genome Res. 12, 1805?1814 (2002). 49. Sabeti, P. C. et al. Positive natural selection in the human lineage. Science 312, 1614?1620 (2006). 50. de Bakker, P. I. et al. Transferability of tag SNPs in genetic association studies in multiple populations. Nature Genet. 38, 1298?1303 (2006). 51. Conrad, D. F. et al. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genet. 38, 1251?1260 (2006). 52. Service, S.,Sabatti,C.&Freimer,N.TagSNPschosen fromHapMapperform well in several population isolates. Genet. Epidemiol. 31, 189?194 (2007). 53. Lim, J. et al. Comparative study of the linkage disequilibrium of an ENCODE region, chromosome 7p15, in Korean, Japanese, and Han Chinese samples. Genomics 87, 392?398 (2006). 54. Rabbee, N. & Speed, T. P. A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics 22, 7?12 (2006). 55. Purcell, S. et al. PLINK: a tool set for whole-genome association and population- based linkage analyses. Am. J. Hum. Genet. 81, 559?575 (2007). 56. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904?909 (2006). 57. Smith, R. A., Ho, P. J., Clegg, J. B., Kidd, J. R. & Thein, S. L. Recombination breakpoints in the human b-globin gene cluster. Blood 92, 4415?4421 (1998). 58. Holloway, K., Lawson, V. E. & Jeffreys, A. J. Allelic recombination and de novo deletions in sperm in the human b-globin gene region. Hum. Mol. Genet. 15, 1099?1111 (2006). 59. Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358?1370 (1984). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements We thank many people who contributed to this project: all members of the genotyping laboratory and the sample, primer, bioinformatics, data quality and IT groups at Perlegen Sciences for technical and infrastructural support; J. Beck, C. Beiswanger, D. Coppock, A. Leach, J. Mintzer and L. Toji for transforming the Yoruba, Japanese and Han Chinese samples, distributing the DNA and cell lines, storing the samples for use in future research, and producing the community newsletters and reports; J. Greenberg and R. Anderson for providing funding and support for cell line transformation and storage in the NIGMS Human Genetic Cell Repository at the Coriell Institute; T. Dibling, T.Ishikura,S.Kanazawa,S.MizusawaandS.Saitoforhelpwithgenotyping;C.Hind and A. Moghadam for technical support in genotyping and all members of the subcloningandsequencingteamsattheWellcomeTrustSangerInstitute;X.Kefor helpwithdataanalysis;OxfordE-ScienceCentreforprovisionofhigh-performance computing resources; H. Chen, W. Chen, L. Deng, Y. Dong, C. Fu, L. Gao, H. Geng, J.Geng,M.He,H.Li,H.Li,S.Li,X.Li,B.Liu,Z.Liu,F.Lu,F.Lu,G.Lu,C.Luo,X.Wang, Z. Wang, C. Ye and X. Yu for help with genotyping and sample collection; X. Feng, Y. Li, J. Ren and X. Zhou for help with sample collection; J. Fan, W. Gu, W. Guan, S. Hu, H. Jiang, R. Lei, Y. Lin, Z. Niu, B. Wang, L. Yang, W. Yang, Y. Wang, Z. Wang, S.Xu,W.Yan,H.Yang,W.Yuan,C.Zhang,J.Zhang,K.ZhangandG.Zhaoforhelp with genotyping; P. Fong, C. Lai, C. Lau, T. Leung, L. Luk and W. Tong for help with genotyping; C. Pang for help with genotyping; K. Ding, B. Qiang, J. Zhang, X. Zhang and K. Zhou for help with genotyping; Q. Fu, S. Ghose, X. Lu, D. Nelson, A. Perez, S.Poole,R.VegaandH.Yonathforhelpwithgenotyping;C.Bruckner,T.Brundage, S. Chow, O. Iartchouk, M. Jain, M. Moorhead and K. Tran for help with genotyping; N.Addleman,J.Atilano,T.Chan,C.Chu,C.Ha,T.Nguyen,M.MintonandA.Phong forhelpwithgenotyping,andD.Lindforhelpwithqualitycontrolandexperimental design; R. Donaldson and S. Duan for help with genotyping, and J. Rice and N. Saccone for help with experimental design; J. Wigginton for help with implementing and testing QA/QC software; A. Clark, B. Keats, R. Myers, D. Nickerson and A.Williamsonforproviding advice toNIH; C.Juenger,C. Bennet, C. Bird, J. Melone, P. Nailer, M. Weiss, J. Witonsky and E. DeHaut-Combs for help withprojectmanagement;M.Grayfororganizingphonecallsandmeetings;D.Leja for help with figures; the Yoruba people of Ibadan, Nigeria, the people of Tokyo, Japan, and the community at Beijing Normal University, who participated in public consultationsandcommunityengagements;thepeopleinthesecommunitieswho donated their blood samples; and the people in the Utah CEPH community who allowed the samples they donated earlier to beused for the Project. Thiswork was supported by the Japanese Ministry of Education, Culture, Sports, Science and Technology, the Wellcome Trust, Nuffield Trust, Wolfson Foundation, UK EPSRC, GenomeCanada,Ge�nomeQue�bec,theChineseAcademyofSciences,theMinistry of Science and Technology of the People?s Republic of China, the National Natural Science Foundation of China, the Hong Kong Innovation and Technology Commission, the University Grants Committee of Hong Kong, the SNP Consortium, the US National Institutes of Health (FIC, NCI, NCRR, NEI, NHGRI, NIA, NIAAA, NIAID, NIAMS, NIBIB, NIDA, NIDCD, NIDCR, NIDDK, NIEHS, NIGMS, NIMH, NINDS, NLM, OD), the W.M. Keck Foundation, and the Delores DoreEcclesFoundation.AllSNPsgenotypedwithintheHapMapProjectareavailable from dbSNP (http://www.ncbi.nlm.nih.gov/SNP); all genotype information is available from dbSNP and the HapMap website (http://www.hapmap.org). Author Information Reprints and permissions information is available at www.nature.com/reprints. The authors declare competing financial interests: details accompany the full-text HTML version of the paper at www.nature.com/ nature. Correspondence and requests for materials should be addressed to G.M. (mcvean@stats.ox.ac.uk) or M.D. (mjdaly@chgr.mgh.harvard.edu). The International HapMap Consortium (Participants are arranged by institution and then alphabetically within institutions except for Principal Investigators and Project Leaders, as indicated.) Genotyping centres: Perlegen Sciences Kelly A. Frazer (Principal Investigator) 1 , DennisG.Ballinger 2 ,DavidR.Cox 2 ,DavidA.Hinds 2 ,LauraL.Stuve 2 ;BaylorCollegeof MedicineandParAlleleBioScienceRichardA.Gibbs(PrincipalInvestigator) 3 ,JohnW. Belmont 3 , Andrew Boudreau 4 , Paul Hardenbol 5 , Suzanne M. Leal 3 , Shiran Pasternak 6 , DavidA.Wheeler 3 ,ThomasD.Willis 4 ,FuliYu 7 ;BeijingGenomicsInstituteHuanming Yang (Principal Investigator) 8 , Changqing Zeng (Principal Investigator) 8 , Yang Gao 8 , HaoranHu 8 ,WeitaoHu 8 ,ChaohuaLi 8 ,WeiLin 8 ,SiqiLiu 8 ,HaoPan 8 ,XiaoliTang 8 ,Jian Wang 8 ,WeiWang 8 ,JunYu 8 ,BoZhang 8 ,QingrunZhang 8 ,HongbinZhao 8 ,HuiZhao 8 , Jun Zhou 8 ; Broad Institute of Harvard and Massachusetts Institute of Technology NATURE|Vol 449|18 October 2007 ARTICLES 859 Nature �2007 Publishing Group Stacey B. Gabriel (Project Leader) 7 , Rachel Barry 7 , Brendan Blumenstiel 7 , Amy Camargo 7 ,MatthewDefelice 7 ,MauraFaggart 7 ,MaryGoyette 7 ,SupriyaGupta 7 ,Jamie Moore 7 , Huy Nguyen 7 , Robert C. Onofrio 7 , Melissa Parkin 7 , Jessica Roy 7 , Erich Stahl 7 , EllenWinchester 7 ,LiudaZiaugra 7 ,DavidAltshuler(PrincipalInvestigator) 7,9 ;Chinese National Human Genome Center at Beijing Yan Shen (Principal Investigator) 10 , Zhijian Yao 10 ; Chinese National Human Genome Center at Shanghai Wei Huang (PrincipalInvestigator) 11 ,XunChu 11 ,YungangHe 11 ,LiJin 12 ,YangfanLiu 11 ,YayunShen 11 , Weiwei Sun 11 , Haifeng Wang 11 , Yi Wang 11 , Ying Wang 11 , Xiaoyan Xiong 11 , Liang Xu 11 ; ChineseUniversityofHongKongMaryM.Y.Waye(PrincipalInvestigator) 13 ,Stephen K. W. Tsui 13 ; Hong Kong University of Science and Technology Hong Xue (Principal Investigator) 14 , J. Tze-Fei Wong 14 ; Illumina Luana M. Galver (Project Leader) 15 , Jian-Bing Fan 15 , Kevin Gunderson 15 , Sarah S. Murray 1 , Arnold R. Oliphant 16 , Mark S. Chee (Principal Investigator) 17 ; McGill University and Ge�nome Que�bec Innovation Centre Alexandre Montpetit (Project Leader) 18 , Fanny Chagnon 18 , Vincent Ferretti 18 , Martin Leboeuf 18 , Jean-Franc�ois Olivier 4 , Michael S. Phillips 18 ,Ste�phanie Roumy 15 , Cle�mentine Salle�e 19 , Andrei Verner 18 , Thomas J. Hudson (Principal Investigator) 20 ; University of California at San Francisco and Washington University Pui-Yan Kwok (Principal Investigator) 21 , Dongmei Cai 21 , Daniel C. Koboldt 22 , Raymond D. Miller 22 , Ludmila Pawlikowska 21 , Patricia Taillon-Miller 22 , Ming Xiao 21 ; University of Hong KongLap-CheeTsui(PrincipalInvestigator) 23 ,WilliamMak 23 ,YouQiangSong 23 ,Paul K. H. Tam 23 ; University of Tokyo and RIKEN Yusuke Nakamura (Principal Investigator) 24,25 , Takahisa Kawaguchi 25 , Takuya Kitamoto 25 , Takashi Morizono 25 , Atsushi Nagashima 25 , Yozo Ohnishi 25 , Akihiro Sekine 25 , Toshihiro Tanaka 25 , Tatsuhiko Tsunoda 25 ; Wellcome Trust Sanger Institute Panos Deloukas (Project Leader) 26 , Christine P. Bird 26 , Marcos Delgado 26 , Emmanouil T. Dermitzakis 26 , Rhian Gwilliam 26 , Sarah Hunt 26 , Jonathan Morrison 27 , Don Powell 26 , Barbara E. Stranger 26 , Pamela Whittaker 26 , David R. Bentley (Principal Investigator) 28 Analysis groups: Broad Institute Mark J. Daly (Project Leader) 7,9 , Paul I. W. de Bakker 7,9 , Jeff Barrett 7,9 , Yves R. Chretien 7 , Julian Maller 7,9 , Steve McCarroll 7,9 , Nick Patterson 7 ,ItsikPe?er 29 ,AlkesPrice 7 ,ShaunPurcell 9 ,DanielJ.Richter 7 ,PardisSabeti 7 , RichaSaxena 7,9 ,StephenF.Schaffner 7 ,PakC.Sham 23 ,PatrickVarilly 7 ,DavidAltshuler (Principal Investigator) 7,9 ; Cold Spring Harbor Laboratory Lincoln D. Stein (Principal Investigator) 6 , Lalitha Krishnan 6 , Albert Vernon Smith 6 , Marcela K. Tello-Ruiz 6 , Gudmundur A. Thorisson 30 ; Johns Hopkins University School of Medicine Aravinda Chakravarti (Principal Investigator) 31 , Peter E. Chen 31 , David J. Cutler 31 , Carl S. Kashuk 31 , Shin Lin 31 ; University of Michigan Gonc�alo R. Abecasis (Principal Investigator) 32 , Weihua Guan 32 , Yun Li 32 , Heather M. Munro 33 , Zhaohui Steve Qin 32 , Daryl J. Thomas 34 ; University of Oxford Gilean McVean (Project Leader) 35 , Adam Auton 35 ,LeonardoBottolo 35 ,NiallCardin 35 ,SusanaEyheramendy 35 ,ColinFreeman 35 , Jonathan Marchini 35 , Simon Myers 35 , Chris Spencer 7 , Matthew Stephens 36 , Peter Donnelly (Principal Investigator) 35 ; University of Oxford, Wellcome Trust Centre for Human Genetics Lon R. Cardon (Principal Investigator) 37 , Geraldine Clarke 38 , David M.Evans 38 ,AndrewP.Morris 38 ,BruceS.Weir 39 ;RIKENTatsuhikoTsunoda(Principal Investigator) 25 , Todd A. Johnson 25 ; US National Institutes of Health James C. Mullikin 40 ; US National Institutes of Health National Center for Biotechnology Information Stephen T. Sherry 41 , Michael Feolo 41 , Andrew Skol 42 Community engagement/public consultation and sample collection groups: Beijing Normal University and Beijing Genomics Institute Houcan Zhang 43 , Changqing Zeng 8 , Hui Zhao 8 ; Health Sciences University of Hokkaido, Eubios Ethics Institute, and Shinshu University Ichiro Matsuda (Principal Investigator) 44 , Yoshimitsu Fukushima 45 , Darryl R. Macer 46 , Eiko Suda 47 ; Howard University and University of Ibadan Charles N. Rotimi (Principal Investigator) 48 , Clement A. Adebamowo 49 ,Ike Ajayi 49 , Toyin Aniagwu 49 , Patricia A. Marshall 50 , Chibuzor Nkwodimmah 49 , Charmaine D. M. Royal 48 ; University of Utah Mark F. Leppert (Principal Investigator) 51 , Missy Dixon 51 , Andy Peiffer 51 Ethical, legal and social issues: Chinese Academy of Social Sciences Renzong Qiu 52 ; Genetic Interest Group Alastair Kent 53 ; Kyoto University Kazuto Kato 54 ; Nagasaki University Norio Niikawa 55 ; University of Ibadan School of Medicine Isaac F. Adewole 49 ; University of Montre�al Bartha M. Knoppers 19 ; University of Oklahoma Morris W. Foster 56 ; Vanderbilt University Ellen Wright Clayton 57 ; Wellcome Trust Jessica Watkin 58 SNPdiscovery:BaylorCollegeofMedicineRichardA.Gibbs(PrincipalInvestigator) 3 , John W. Belmont 3 , Donna Muzny 3 , Lynne Nazareth 3 , Erica Sodergren 3 , George M. Weinstock 3 , David A. Wheeler 3 , Imtaz Yakub 3 ; Broad Institute of Harvard and MassachusettsInstituteofTechnologyStaceyB.Gabriel(ProjectLeader) 7 ,RobertC. Onofrio 7 , Daniel J. Richter 7 , Liuda Ziaugra 7 , Bruce W. Birren 7 , Mark J. Daly 7,9 , David Altshuler (Principal Investigator) 7,9 ; Washington University Richard K. Wilson (Principal Investigator) 59 , Lucinda L. Fulton 59 ; Wellcome Trust Sanger Institute Jane Rogers (Principal Investigator) 26 , John Burton 26 , Nigel P. Carter 26 , Christopher M. Clee 26 , Mark Griffiths 26 , Matthew C. Jones 26 , Kirsten McLay 26 , Robert W. Plumb 26 , Mark T. Ross 26 , Sarah K. Sims 26 , David L. Willey 26 Scientific management: Chinese Academy of Sciences Zhu Chen 60 , Hua Han 60 ,Le Kang 60 ; Genome Canada Martin Godbout 61 , John C. Wallenburg 62 ; Ge�nome Que�bec Paul L?Archeve?que 63 , Guy Bellemare 63 ; Japanese Ministry of Education, Culture, Sports, Science and Technology Koji Saeki 64 ; Ministry of Science and Technology of the People?s Republic of China Hongguang Wang 65 , Daochang An 65 , Hongbo Fu 65 , Qing Li 65 , Zhen Wang 65 ; The Human Genetic Resource Administration of China Renwu Wang 66 ; The SNP Consortium Arthur L. Holden 15 ; US National Institutes of Health Lisa D. Brooks 67 , Jean E. McEwen 67 , Mark S. Guyer 67 , Vivian Ota Wang 67,68 , Jane L. Peterson 67 , Michael Shi 69 , Jack Spiegel 70 , Lawrence M. Sung 71 , Lynn F. Zacharia 67 , Francis S. Collins 72 ; Wellcome Trust Karen Kennedy 61 , Ruth Jamieson 58 , John Stewart 58 1 The Scripps Research Institute, 10550 North Torrey Pines Road MEM275, La Jolla, California 92037, USA. 2 Perlegen Sciences, Inc., 2021 Stierlin Court, Mountain View, California 94043, USA. 3 Baylor College of Medicine, Human Genome Sequencing Center, Department of Molecular and Human Genetics, 1 Baylor Plaza, Houston, Texas 77030, USA. 4 Affymetrix, Inc., 3420 Central Expressway, Santa Clara, California 95051, USA. 5 PacificBiosciences,1505 AdamsDrive,MenloPark,California 94025,USA. 6 Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA. 7 The Broad Institute of Harvard and Massachusetts Institute of Technology, 1 Kendall Square, Cambridge, Massachusetts 02139, USA. 8 Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 100300, China. 9 Massachusetts General Hospital and Harvard Medical School, Simches Research Center, 185 Cambridge Street, Boston, Massachusetts02114,USA. 10 ChineseNationalHumanGenomeCenteratBeijing,3-707 N.YongchangRoad,BeijingEconomic-TechnologicalDevelopmentArea,Beijing100176, China. 11 ChineseNationalHumanGenomeCenteratShanghai,250BiBoRoad,Shanghai 201203, China. 12 Fudan University and CAS-MPG Partner Institute for Computational Biology, School of Life Sciences, SIBS, CAS, Shanghai 201203, China. 13 The Chinese University of Hong Kong, Department of Biochemistry, The Croucher Laboratory for Human Genetics, 6/F Mong Man Wai Building, Shatin, Hong Kong. 14 Hong Kong University of Science and Technology, Department of Biochemistry and Applied Genomics Center, Clear Water Bay, Knowloon, Hong Kong. 15 Illumina, 9885 Towne Centre Drive, San Diego, California 92121, USA. 16 Complete Genomics, Inc., 658 North Pastoria Avenue, Sunnyvale, California 94085, USA. 17 Prognosys Biosciences, Inc., 4215 Sorrento Valley Boulevard, Suite 105, San Diego, California 92121, USA. 18 McGill University and Ge�nome Que�bec Innovation Centre, 740 Dr. Penfield Avenue, Montre�al, Que�bec H3A 1A4, Canada. 19 University of Montre�al, The Public Law Research Centre (CRDP), PO Box 6128, Downtown Station, Montre�al, Que�bec H3C 3J7, Canada. 20 Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 500, Toronto, Ontario M5G 1L7, Canada. 21 University of California, San Francisco, Cardiovascular Research Institute, 513 Parnassus Avenue, Box 0793, San Francisco, California 94143, USA. 22 Washington University School of Medicine, Department of Genetics, 660 South Euclid Avenue, Box 8232, St Louis, Missouri 63110, USA. 23 University ofHongKong,Genome ResearchCentre, 6/F, Laboratory Block,21Sassoon Road, Pokfulam, Hong Kong. 24 University of Tokyo, Institute of Medical Science, 4-6-1 Sirokanedai, Minato-ku, Tokyo 108-8639, Japan. 25 RIKEN SNP Research Center, 1-7-22 Suehiro-cho, Tsurumi-ku Yokohama, Kanagawa 230-0045, Japan. 26 Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. 27 University of Cambridge, Department of Oncology, Cambridge CB1 8RN, UK. 28 Solexa Ltd, Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex CB10 1XL, UK. 29 Columbia University, 500 West 120th Street, New York, New York 10027, USA. 30 University of Leicester, Department of Genetics, Leicester LE1 7RH, UK. 31 Johns Hopkins University School of Medicine, McKusick-Nathans Institute of Genetic Medicine, Broadway Research Building, Suite 579, 733 North Broadway, Baltimore, Maryland 21205, USA. 32 University of Michigan, Center for Statistical Genetics, Department of Biostatistics, 1420 Washington Heights, Ann Arbor, Michigan 48109, USA. 33 International Epidemiology Institute, 1455 Research Boulevard, Suite 550, Rockville, Maryland 20850, USA. 34 Center for Biomolecular Science and Engineering, Engineering 2, Suite 501, Mail Stop CBSE/ITI, UC Santa Cruz, Santa Cruz, California 95064, USA. 35 University of Oxford, Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK. 36 University of Chicago, Department of Statistics, 5734 South University Avenue, Eckhart Hall, Room 126, Chicago, Illinois 60637, USA. 37 Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington 98109, USA. 38 University of Oxford/Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK. 39 University of Washington Department of Biostatistics, Box 357232, Seattle, Washington 98195, USA. 40 US National Institutes of Health, National Human Genome Research Institute, 50 South Drive, Bethesda, Maryland 20892, USA. 41 US National Institutes of Health, National Library of Medicine, NationalCenterforBiotechnologyInformation,8600RockvillePike,Bethesda,Maryland 20894, USA. 42 University of Chicago, Department of Medicine, Section of Genetic Medicine, 5801 South Ellis,Chicago, Illinois60637,USA. 43 BeijingNormalUniversity, 19 Xinjiekouwai Street, Beijing 100875, China. 44 Health Sciences University of Hokkaido, Ishikari Tobetsu Machi 1757, Hokkaido 061-0293, Japan. 45 Shinshu University School of Medicine, Department of Medical Genetics, Matsumoto 390-8621, Japan. 46 United Nations Educational, Scientific and Cultural Organization (UNESCO Bangkok), 920 Sukhumwit Road, Prakanong, Bangkok 10110, Thailand. 47 University of Tsukuba, Eubios Ethics Institute, PO Box 125, Tsukuba Science City 305-8691, Japan. 48 Howard University, National Human Genome Center, 2216 6th Street, NW, Washington DC 20059, USA. 49 University of Ibadan College of Medicine, Ibadan, Oyo State, Nigeria. 50 CaseWesternReserveUniversitySchoolofMedicine,DepartmentofBioethics,10900 Euclid Avenue, Cleveland, Ohio 44106, USA. 51 University of Utah, Eccles Institute of Human Genetics, Department of Human Genetics, 15 North 2030 East, Salt Lake City, Utah 84112,USA. 52 Chinese Academy ofSocialSciences,Institute ofPhilosophy/Center for Applied Ethics, 2121, Building 9, Caoqiao Xinyuan 3 Qu, Beijing 100067, China. 53 Genetic Interest Group,4DLeroyHouse, 436Essex Road,London N130P, UK. 54 Kyoto University, Institute for Research in Humanities and Graduate School of Biostudies, Ushinomiya-cho, Sakyo-ku, Kyoto 606-8501, Japan. 55 Nagasaki University Graduate ARTICLES NATURE|Vol 449|18 October 2007 860 Nature �2007 Publishing Group School of Biomedical Sciences, Department of Human Genetics, Sakamoto 1-12-4, Nagasaki852-8523,Japan. 56 UniversityofOklahoma,DepartmentofAnthropology,455 West Lindsey Street, Norman, Oklahoma 73019, USA. 57 Vanderbilt University, Center for Genetics and Health Policy, 507 Light Hall, Nashville, Tennessee 37232, USA. 58 Wellcome Trust, 215 Euston Road, London NW1 2BE, UK. 59 Washington University SchoolofMedicine,GenomeSequencingCenter,Box8501,4444ForestParkAvenue,St Louis, Missouri 63108, USA. 60 Chinese Academy of Sciences, 52 Sanlihe Road, Beijing 100864,China. 61 GenomeCanada,150MetcalfeStreet,Suite2100,Ottawa,OntarioK2P 1P1, Canada. 62 McGill University, Office of Technology Transfer, 3550 University Street, Montre�al, Que�bec H3A 2A7, Canada. 63 Ge�nome Que�bec, 630, boulevard Rene�-Le�vesque Ouest, Montre�al, Que�bec H3B 1S6, Canada. 64 Ministry of Education, Culture, Sports, Science, and Technology, 3-2-2 Kasumigaseki, Chiyodaku, Tokyo 100-8959, Japan. 65 Ministry of Science and Technology of the People?s Republic of China, 15 B. Fuxing Road, Beijing 100862, China. 66 The Human Genetic Resource Administration of China, b7, Zaojunmiao, Haidian District, Beijing 100081, China. 67 US National Institutes of Health, National Human Genome Research Institute, 5635 Fishers Lane, Bethesda, Maryland 20892, USA. 68 US National Institutes of Health, Office of Behavioral and Social Science Research, 31 Center Drive, Bethesda, Maryland 20892, USA. 69 Novartis Pharmaceuticals Corporation, Biomarker Development, One Health Plaza, East Hanover, New Jersey 07936, USA. 70 US National Institutes of Health, Office of Technology Transfer, 6011 Executive Boulevard, Rockville, Maryland 20852, USA. 71 UniversityofMarylandSchoolofLaw,500WestBaltimoreStreet,Baltimore,Maryland 21201, USA. 72 US National Institutes of Health, National Human Genome Research Institute, 31 Center Drive, Bethesda, Maryland 20892, USA. NATURE|Vol 449|18 October 2007 ARTICLES 861 Nature �2007 Publishing Group METHODS SNP selection and genotyping. All SNPs in dbSNP release 122 were considered for genotyping by Perlegen. Among these the following were excluded: SNPs for which no assay could be designed (primarily through location in repeat-rich regions; approximately 2.5 million); SNPs shown previously in samples from related populations 15 to be most probably in perfect association (r 2 51) with a PhaseI SNP(approximately 122,000);all butone ofSNPsshownpreviously 15 to be most probably in perfect association (r 2 51) with each other but not with a Phase I SNP (approximately 62,000); and SNPs shown previously 15 to have MAF,0.05 (approximately 119,000). In addition, a few SNPs were excluded for efficiency (for example, if an amplicon contained a single SNP). Approximately 30,000 SNPs that had been typed in Phase I were deliberately retyped in Phase II to allow detailed comparisons of data quality, and an addi- tional 15,000 SNPs that showed discrepancies between multiple genotyping attempts in Phase I were re-typed in Phase II. A further 2,000 SNPs identified by the Mammalian Gene Collection were also typed. Perlegen performed genotyping using custom high-density oligonucleotide arrays as previously described 15 . Initially, a pilot phase was carried out on chro- mosome 2p to optimize experimental workflow and data handling. Details of amplicons used in the experiment and PCR primers can be found at http://genome.perlegen.com/pcr/ and also on the HapMap website. The arrays weretiledwithsetsof25-bpprobesforeachSNP,witheither40or24probesper SNP. These consisted of four sets of features, corresponding to forward and reversestrandtilingsofsequencescomplementarytoeachofthetwoSNPalleles. Within a feature set, the position of the SNP within the oligonucleotide varied from position 11 to position 15. Mismatch probes were used to measure back- ground,andbycomparisonwiththeperfectmatchprobes,todetectthepresence or absence of a specific PCR product. The 40-feature and 24-feature tilings both provided 10 perfect-match features for each SNP allele and differed only in the number of mismatch probes. Genotypes were scored by clustering intensity measurements as previously described 15 . In addition, quality scores similar to Phred scores were computed for each genotype call, based on a combination of experimental metrics corre- lated to data quality. Assays with overall call rates less than 80% or with poor average quality scores were flagged as failed. About 38% of the tiled assays failed these basic criteria, and the remainder were processed using the more rigorous HapMap Project data quality control filters. For analysis of the whole genome, probesfor4,373,926distinctSNPsweretiledonto32chipdesigns,with32SNPs tiledinreplicateontoeachchipdesignforqualitycontrol(QC).Perlegendidnot type the samples by plates as had been done for the Phase I genotyping, instead typinglargenumbersofSNPsonesampleatatime.Consequently,blankwellson each plate were not included as a component of QC for this genotyping. In the Phase I HapMap a single JPT sample had been excluded because of technical problems. Perlegen typed a replacement sample (from the original JPT collec- tion)forallnewSNPs.ThissamplewasnotspecificallygenotypedonthePhaseI SNPs, although a substantial fraction of these was typed in Phase II. Additional genotype submissions came from the Affymetrix GeneChip Human Mapping 500K array called with the BRLMM algorithm. In release 21a additional genotype submissions were incorporated from the MHC haplo- type consortium 11 , the Illumina HumanHap300 BeadChip, the Illumina Human-1 Genotyping BeadChip and the 10K non-synonymous SNP set from Affymetrix (ParAllele). Details of primer design, DNA amplification, DNA labelling and hybridiza- tion and signal detection for the Perlegen platform can be found in Supple- mentary Text 7. QC analyses. Genotype submissions were assessed for mendelian errors (where possible), missing data rates and Hardy?Weinberg proportions. QC filters were appliedaspreviouslydescribed 3 ;toachieveQC1statusaSNPhadtohavefewer thantwomendelianerrors,lessthan20%missingdataandP.0.001forHardy? Weinberganalysis.TheconsensusdatasetconsistsonlyofSNPsforwhichQC1 submissionswereavailablefromallanalysispanels.Wheremultiplesubmissions met the QC criteria the submission with the lowest missing data rate was chosen for inclusion in the non-redundant filtered data set. Comparison of the Phase II HapMapwiththeAffymetrix500Kgenotypeshasshownapproximately20SNPs where the reported minor allele is discrepant (referred to as ?allele-flipping?). Over the entire data set, we expect that 500?2,000 SNPs have this problem and the vast majority will occur in SNPs from Phase I of the project. The Data Coordination Center (DCC) is working to resolve as many of these as possible. Analyses of data quality. See Supplementary Text 2. Analyses of population stratification, relatedness and homozygosity. See Supplementary Texts 3?6. Analysis of recombination rate and gene ontology. We used the Panther Database 41 to obtain details of the gene molecular function and biological pro- cess.Genesaregroupedinto 28top-levelmolecularfunctiongroupsand30top- level biological processgroups,with eachgene allowedto exist in more than one group.Weidentified14,979non-overlappingautosomalgenesfromthePanther RefSeq Annotation for which we could obtain recombination rates. Of these, 9,735 had at least one assigned molecular function and 9,432 had at least one assigned biological process. Genes without a molecular function or biological processwereremovedfromthecorrespondinganalysis.Tocontrolforgenesize, we estimated the mean recombination rate over a 20-kb region centred on the mid-point of each gene transcription region. Genes were grouped based on molecular function and biological process. A mean recombination rate was calculated for each group. The significance of the result from each group was calculated via a permutation test involving 10 5 random groupings of genes. No correction was made for multiple testing. To account for the effect of G1C content on recombination, we performed a linear regression between the G1C content and recombination rate of all genes in each sample. Using the estimated regression parameters, the propor- tion of recombination explained by G1C content was subtracted from each gene. Identificationofnon-synonymousSNPsandtestsfornaturalselection.Using annotations from dbSNP release 125 we identified 17,427 polymorphic non- synonymous SNPs in release 21 and 15,976 polymorphic synonymous SNPs. Of these, 15,583 non-synonymous and 14,324 synonymous SNPs were autosomal and could have ancestral allele status unambiguously assigned by parsimony through comparison to the chimpanzee and macaque genomes. We used the phasedhaplotypesforanalysisin which missingdatahad beenimputed.F ST was calculated using the method of Weir and Cockerham 59 . To detect recent partial selective sweeps we used the long-range haplotype (LRH) test 44,49 and the integrated haplotype score (iHS) test 9 . On simulated data 45 , we found that the tests have similar power to detect recent selection but the iHS test has slightly lower power at low haplotype frequency and the LRH test has slightly lowerpower athigh frequency. Thiscan beseen in applica- tions to HapMap Phase I data 3,9 , where the iHS test misses the well-known cases of HBB and CD36 and the LRH test misses the SULT1C2 region. Although both tests are based on the concept of EHH 44 , we observed that the false positives produced by the two tests tend not to overlap and thus that signals detected by both tests have a very low false-positive rate. doi:10.1038/nature06258 Nature �2007 Publishing Group "
Add Content to Group
|
Bookmark
|
Keywords
|
Flag Inappropriate
HOME
LIBRARY
Library
Visual Browse
Close