Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs1,2 and intermediate-sized variants (ISVs)3. However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Big Data Analytics Open Access 16 May 2018
BMC Genomics Open Access 26 February 2013
BMC Bioinformatics Open Access 18 February 2013
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Marth, G.T. et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452–456 (1999).
Tsui, C. et al. Single nucleotide polymorphisms (SNPs) that map to gaps in the human SNP map. Nucleic Acids Res. 31, 4910–4916 (2003).
Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).
Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Myers, E.W., Sutton, G.G., Smith, H.O., Adams, M.D. & Venter, J.C. On the sequencing and assembly of the human genome. Proc. Natl. Acad. Sci. USA 99, 4145–4146 (2002).
Adams, M.D., Sutton, G.G., Smith, H.O., Myers, E.W. & Venter, J.C. The independence of our genome assemblies. Proc. Natl. Acad. Sci. USA 100, 3025–3026 (2003).
Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA 101, 1916–1921 (2004).
Waterston, R.H., Lander, E.S. & Sulston, J.E. On the sequencing of the human genome. Proc. Natl. Acad. Sci. USA 99, 3712–3716 (2002).
Waterston, R.H., Lander, E.S. & Sulston, J.E. More on the sequencing of the human genome. Proc. Natl. Acad. Sci. USA 100, 3022–3024 (2003).
Feuk, L., Carson, A.R. & Scherer, S.W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).
Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000).
Mobarry, C. & Sutton, G. An assembly-to-assembly comparison tool. in Proceedings of the Third Annual RECOMB Satellite Meeting on DNA Sequencing Technologies and Computation (2003).
Kent, W.J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, D501–D504 (2005).
Bailey, J.A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).
Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 (2004).
Redon, R. et al. Global variation in copy number in the human genome. Nature (in the press).
Wang, J. et al. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum. Mutat. 27, 323–329 (2006).
Hillier, L.W. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003).
Scherer, S.W. et al. Human chromosome 7: DNA sequence and biology. Science 300, 767–772 (2003).
Schmutz, J. et al. The DNA sequence and comparative analysis of human chromosome 5. Nature 431, 268–274 (2004).
Shendure, J., Mitra, R.D., Varma, C. & Church, G.M. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet. 5, 335–344 (2004).
Bennett, S.T., Barnes, C., Cox, A., Davies, L. & Brown, C. Toward the 1,000 dollars human genome. Pharmacogenomics 6, 373–382 (2005).
Service, R.F. Gene sequencing. The race for the $1000 genome. Science 311, 1544–1546 (2006).
Cheung, J. et al. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 4, R25 (2003).
Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Feuk, L. et al. Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS Genet. 1, e56 (2005).
Pfaffl, M.W. A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res. 29, e45 (2001).
Osborne, L.R. et al. A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat. Genet. 29, 321–325 (2001).
We thank T. Tang, L. Wong, J. Wittnam, C.-F. Chu and W. Hwang of The Centre for Applied Genomics for technical assistance. Computational analyses were supported by the Shared Hierarchical Academic Research Computing Network (SHARCNET) and the Centre for Computational Biology at the Hospital for Sick Children. The work was supported by Genome Canada/Ontario Genomics Institute, the Canadian Institutes of Health Research (CIHR), the Canada Foundation for Innovation and the McLaughlin Centre for Molecular Medicine (all to S.W.S). L.A. and X.E. are supported by Genoma España and Genome Canada joint R+D+I projects and by the Generalitat de Catalunya (Departament d'Universitats, 2005SGR00008, and Departament de Salut). L.F. is supported by CIHR. S.W.S. is an Investigator of CIHR and International Scholar of Howard Hughes Medical Institute.
The authors declare no competing financial interests.
Results for MegaBLAST and A2Amapper comparing R27c versus Build 35 and comparing Build 35 versus R27c. (PDF 18 kb)
List of copy-unmatched sequences identified by GCA; table also shows information on repeat content and re-BLAT versus Build 35, Build 36 and chimpanzee Build 1. (XLS 176 kb)
Intra- and interscaffold inversions identified by GCA between R27c and Build 35. (PDF 10 kb)
List of refined set of unmatched sequences used for analysis of overlap with genomic features; all entries in this list with an insertion point were used for genomic overlap analysis. (XLS 5127 kb)
Analysis of RefSeq genes and mRNAs. (XLS 377 kb)
Results and details for PCR-based assays. (XLS 34 kb)
Results and details for fluoresecence in situ hybridization experiments. (PDF 37 kb)
Results of comparisons of single-base mismatches detected by GCA with dbSNP_125 and with HapMap QC+/QC− SNPs. (PDF 59 kb)
Comparison between assembly differences and other genomic features. (XLS 82 kb)
About this article
Cite this article
Khaja, R., Zhang, J., MacDonald, J. et al. Genome assembly comparison identifies structural variants in the human genome. Nat Genet 38, 1413–1418 (2006). https://doi.org/10.1038/ng1921
This article is cited by
Big Data Analytics (2018)
BMC Bioinformatics (2013)
BMC Genomics (2013)
Nature Reviews Genetics (2013)
Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches
Human Genetics (2013)