The human genome sequence will provide a reference for measuring DNA sequence variation in human populations. Sequence variants are responsible for the genetic component of individuality, including complex characteristics such as disease susceptibility and drug response. Most sequence variants are single nucleotide polymorphisms (SNPs), where two alternate bases occur at one position1,2,3. Comparison of any two genomes reveals around 1 SNP per kilobase1,3. A sufficiently dense map of SNPs would allow the detection of sequence variants responsible for particular characteristics on the basis that they are associated with a specific SNP allele4,5,6. Here we have evaluated large-scale sequencing approaches to obtaining SNPs, and have constructed a map of 2,730 SNPs on human chromosome 22. Most of the SNPs are within 25 kilobases of a transcribed exon, and are valuable for association studies. We have scaled up the process, detecting over 65,000 SNPs in the genome as part of The SNP Consortium programme, which is on target to build a map of 1 SNP every 5 kilobases that is integrated with the human genome sequence and that is freely available in the public domain.
Single nucleotide polymorphisms (SNPs) are stable, bi-allelic sequence variants that are distributed throughout the genome, which can be assayed using high-throughput automated methods. Sequence variants have been detected previously by analysis of sequence differences in clusters of expressed sequence tags7,8; or by re-sequencing DNA fragments after amplification from different individuals9,10,11, sometimes following prescreening12,13,14. These approaches are effective for exploring sequence variation of individual genes in depth. An alternative approach, which takes advantage of the reagents from the human genome project, is to detect SNPs in regions of overlap between bacterial clones containing sequences of independent genomes (ref. 2; and E.D. et al., unpublished data). This analysis provides a valuable resource of SNPs in short sections throughout the genome, but each section is interspersed by large regions (typically 0.1–0.5 megabases (Mb)) where the sequence of only one genome is available, and where no polymorphisms can be detected.
We evaluated two large-scale sequencing strategies to identify SNPs to cover the entire human genome at a target density of 1 SNP per 5 kilobases (kb). In the reduced representation shotgun (RRS) strategy15, specific subsets of restriction fragments, made from an equimolar mixture of DNA isolated from unrelated individuals, are repeatedly sampled by sequencing. Reads derived from the same fragment in different genotypes are aligned into clusters, or cliques, and high-confidence sequence differences between any two reads (that is, candidate SNPs) are recorded. Experimental verification of a subset of these candidate SNPs (by re-sequencing to identify the SNP in individual samples of the original DNA panel) tests the criteria used in the computational detection of candidate SNPs. In the genomic-alignment strategy, single reads obtained by shotgun sequencing of a library of DNA fragments are aligned directly to available genomic sequence to detect the candidate SNPs. The two strategies are complementary: the RRS strategy allows detection of SNPs throughout the genome without genomic sequence, but the SNPs are not automatically mapped; whereas the genomic-alignment strategy requires genomic sequence, and provides a map location for each SNP. Both strategies are applicable to any genome.
The availability of the complete sequence of chromosome 22 (ref. 16) enabled us to compare both approaches directly. The efficiency of SNP detection for each approach was measured as the number of SNPs detected relative to the amount of new raw sequence data generated for the analysis (see Table 1). For RRS, chromosome 22 was flow-sorted from seven unrelated individuals (see Methods), and clones from a library of 1.2–2.1-kb fragments generated by HindIII digestion were sequenced. After removing highly repetitive or poor quality sequences, 4,584 reads were aligned into 1,842 clusters (1,013 with more than one read, plus 829 singletons), corresponding to a mean cluster depth (number of reads in clusters / number of clusters) of 2.49. Using a specific set of criteria (see Methods), we detected 455 candidate SNPs in the 1,013 clusters of two or more reads. Experimental verification of a subset of these candidates by re-sequencing confirmed that 95% (74/78) were true SNPs. The remaining four candidate SNPs were homozygous in all DNA samples tested, and are presumed to be sequencing errors. The efficiency of SNP detection was 1 SNP per 10.1 reads (1 SNP per 4.79 kb of raw data; see Table 1). In a separate analysis we detected an additional 13% variants comprising insertion/deletion polymorphisms. Two-thirds of these were variations in poly(A) tracts, and the remainder (4% of all variants) are potentially a valuable additional source of polymorphisms for genetic studies.
In contrast to the RRS strategy, in which the library must be sequenced to a sufficient depth to obtain clusters of multiple reads for SNP detection, genomic-alignment analysis minimally requires alignment of just one read against finished genomic sequence. The genomic-alignment strategy should therefore result in a higher efficiency of SNP detection. Like RRS, the genomic-alignment strategy should also yield high-confidence SNPs, as all human genomic sequence is finished to an accuracy of more than 99.99%, (and therefore has a quality value Q of 40; ref. 17). To test this hypothesis, we aligned the RRS reads obtained from the previous experiment (including all singletons) to the finished chromosome 22 sequence (33.4 Mb). We identified 914 candidate SNPs, and included all the SNPs found by the RRS analysis. The success rate of verification was 94% (115/122). The strategy therefore results in a twofold improvement in the efficiency of SNP detection (1 SNP per 5.0 reads compared with 1 SNP per 10.1 reads; see Table 1).
The genomic-alignment approach should be most efficient if each new sequence read aligns to a separate section of the genome. We therefore constructed a library of randomly sheared fragments from flow-sorted chromosome 22 DNA, and aligned 5,567 high-quality reads to the finished genomic sequence. We detected 1,845 candidate SNPs, and the verification success rate was 97% (33/34). The SNP detection rate in this analysis is 1 SNP per 1,391 bases of raw sequence data of Q ≥ 23 (or 1 SNP per 3.0 reads). Genomic-alignment analysis of the random shotgun sequences (compared with genomic-alignment analysis of the RRS sequences) thus provided a further 1.6-fold improvement in the efficiency of SNP detection.
A total of 2,730 different SNPs (that is, 1 per 12 kb of chromosome 22 sequence) were identified during these studies. The position of each SNP was determined by aligning its flanking sequence to the sequence of chromosome 22. The plots in Fig. 1 illustrate the distribution of SNPs detected by each method. Chromosome 22q contains at least 545 transcribed genes (3,632 exons) on the basis of the reported annotation. 1,043 of the SNPs (38%; magenta bars in Fig. 1) detected in this study lie either inside or within 5 kb of a transcribed exon in the current annotated set. In all, 1,771 (65 %) of the SNPs are within 25 kb of an exon, and may be informative in association studies, depending on the extent of linkage disequilibrium in genomic regions5,18,19. From the current set of annotated transcribed exons, 37% (1,333) of them have at least one SNP within 5 kb, and 84% (3,039) have at least one SNP from the present set within 25 kb. This study therefore already provides SNPs that are sufficiently close to most of the transcribed regions to be in linkage disequilibrium with possible functional variants, and this coverage will improve when a more dense SNP map is produced for the whole genome.
On the basis of the results of this study, and studies at the Whitehead Institute and the Genome Sequencing Center, St Louis, The SNP Consortium (TSC: a consortium of academic and pharmaceutical companies, see http://snp.cshl.org) initiated a programme with the goal of generating a freely available public resource of 300,000 SNPs, of which at least 150,000 would be mapped by April 2001. As part of this work, we have extended our SNP identification programme by application of the RRS strategy to a whole-genome library of PvuII fragments of 0.925–1.250 kb, constructed using DNA isolated from 24 unrelated individuals (the same panel used by all TSC participants; see Methods). Up to April 2000, sequencing to a cluster depth of 2.51 has yielded 52,354 candidate SNPs. This corresponds to a detection rate of 1 SNP per 9.0 reads. The verification success rate for these SNPs was 97% (122/128). Candidate SNPs in three additional loci were heterozygous in all DNA samples tested, and these loci are presumed to be low-copy repeats. The success rate of SNP verification concurs with the results of the pilot study, and confirms the feasibility of the RRS approach to detect SNPs in the whole genome. Furthermore, alignment of the PvuII RRS data with the 500 Mb finished genomic sequence that is available resulted in identification of a total 19,192 SNPs (including 9,727 not detected by RRS). On the basis that roughly 2,700 Mb of the genome sequence becomes available, the number of SNPs found by genomic-alignment analysis of this set of RRS reads would be twofold higher than by RRS analysis alone, in agreement with the results of the chromosome 22 study.
Production of the working draft sequence (comprising mapped bacterial clones each sequenced to a depth of ≥ threefold in bases of Q ≥ 20; ref. 20) of the human genome is well advanced (http://www.nhgri.nih.gov/HGP). Genomic-alignment analysis of RRS or random shotgun sequences using the draft would provide a high yield of candidate SNPs. The sequence reads are assembled using the program PHRAP (P. Green, personal communication), which provides a combined quality score for each base of the consensus sequence. This PHRAP quality score can be used for identification of candidate SNPs by genomic-alignment analysis using the working draft, in the same way as using finished sequence, by requiring a minimum PHRAP quality score (Q) of 40. After aligning the sequence data from the PvuII RRS library against a sample of unfinished sequence with variable quality scores, a set of candidate SNPs was selected using a PHRAP Q value of ≥ 40. The verification success rate was 99% (244/246). This confirmed the validity of using genomic-alignment analysis with the working draft sequence of the genome to identify (and map) candidate SNPs.
The initial goal of the TSC was to identify 300,000 SNPs (1 per 10 kb) by April 2001. Our work shows that the goal can be achieved using the RRS strategy and the resources currently available in the consortium. To our benefit, the acceleration in the human genome sequencing programme over the past 12 months will provide the opportunity to perform genomic-alignment analysis on all the sequence data, thus providing a substantial improvement in the yield of SNPs (at least twofold on current projections, or 600,000 SNPs) from the TSC programme. Furthermore, the majority of SNPs will be mapped directly by alignment to the genomic sequence, thus providing an SNP map (1 SNP per 5 kb on average) of the human genome. This freely available, public resource will underpin extensive genetic studies to establish the extent of linkage disequilibrium in the human genome, and to investigate the association of genes with complex disease and other phenotypic traits.
Sequencing and alignment
Chromosome 22 was flow-sorted from individual lymphoblastoid cell lines derived from seven individuals of north European origin, selected from the Porton Down collection of unrelated individuals collected as controls for disease association studies (Panel HRC, nos 159, 146, 184, 163, 160, 226, 193, 575 and 148). Chromosome preparations were pooled and digested with HindIII, and a gel-purified size fraction representing 1.2–2.1 kb was subcloned into the plasmid vector pUC18. We picked clones from a library of more than 100,000 transformants and sequenced them using forward and reverse primers as described21. Data were collected on ABI377 or 3700 machines (PE Biosystems), and analysed using the base-calling algorithm PHRED22,23. All sequence reads that contained more than 100 bases with a PHRED quality score ( Q) ≥ 30 were selected for masking of repeats using RepeatMasker (version 21/03/99: A. Smit and P. Green, http://ftp.genome.washington.edu/RM/RepeatMasker.html). Reads with more than 80 bases of unique sequence were then assembled using a method based on Cross_Match followed by multiple sequence alignment of the reads matching unique genomic loci on chromosome 22.
The PvuII RRS libraries were constructed using genomic DNA from 24 individuals containing one or more representatives of a range of ethnic groups (the DNA Polymorphism Discovery Resource M24PDR, Coriell Cell Repositories24), by agarose gel fractionation of a complete digest of a mixture of the DNAs, followed by subcloning of specific size fractions in pUC18. For assembly of forward and reverse sequence reads, PHRAP (P. Green, University of Washington) was used to assemble the RRS PvuII library data (see Table 1).
Criteria for selection of candidate SNPs were as follows (see also ref. 15): (1) the quality value (Q) of the SNP base (a value recorded for each base in a sequence read by the base-calling program PHRED22,23) was ≥ 23; and the Q value for the 5 bases on either side of the SNP was ≥ 15. (2) At least nine of the flanking ten bases matched between reads. All ten had to match for the Cross_Match (P. Green, University of Washington) based method. (3) The cluster depth was no greater than eight reads, on the basis that deeper clusters might comprise a low-copy repeat. (4) The number of candidate SNPs in a cluster was ≤ 4, on the basis that clusters with more divergent sequences might be composed of low-copy repeats (that is, recently diverged paralogous sequences, accumulating sequence differences between them). The minimum Q for the SNP base was chosen as follows: visual examination of the trace data for a randomly selected set of candidate SNPs called at Q ≥ 20 revealed 15 out of 111 (13%) obvious sequence artefacts, 9 of which had Q values in the range from 20 to 22. When the SNP identification was rerun at Q ≥ 23, a new randomly selected set of candidate SNPs was examined, from which only 5 out of 117 (4.3%) failures were observed. This failure rate was confirmed by the experimental verification data.
Classification of single-base substitutions
We classified single-base substitutions on the basis of transitions or transversions as follows. C to T (or G to A) transitions: 70.1% of all SNPs were possible CpG to TpG mutations (the most frequently observed single-base substitution, which is presumed to arise following deamination of 5-methylcytosine in Me-CpG; ref. 25); 29.1% of all SNPs were transversions of C/A, C/G and T/A (15.7%, 8.6% and 5.6% of all SNPs, respectively).
Candidate SNPs from each dataset were selected at random for experimental verification. Primer pairs were designed using PRIMER (http://www.genome.wi.mit.edu/genome_software/genome_software_index.html) and loci were amplified from the DNA samples of the individual cell lines used for the initial library construction. Polymerase chain reaction products were purified by treatment with shrimp alkaline phosphatase and exonuclease 1, sequenced from both ends, and the data for each SNP assembled and examined in a GAP4 database.
All data including sequence reads, candidate SNPs and verification information were submitted to the Data Coordination Centre (DCC) of The SNP Consortium (http://snp.cshl.org). Each SNP is assigned a unique identifier (for example, TSC0137673) and released in the public domain when given a map position either by alignment to mapped genomic sequence, or by whole-genome radiation hybrid mapping. SNP information is available at the above web site (from the home page, click on ‘data’, then ‘object search’, and enter ‘TSC0137673’, and search); and also in dbSNP (http://www.ncbi.nlm.nih.gov/SNP).
We are grateful for the contribution of all the members of the sequencing and supporting teams, and members of the informatics group of the Sanger Centre (see http://www.sanger.ac.uk). We thank the Wellcome Trust and The SNP Consortium for financial support, and members of the participating centres for helpful discussions.