Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Genome assembly comparison identifies structural variants in the human genome


Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs1,2 and intermediate-sized variants (ISVs)3. However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of the different types of alignments and assembly differences extracted from the R27c and Build 35 genome assemblies.
Figure 2: Genome-wide overview of insertion points of unmatched and copy-unmatched sequences present in R27c with no corresponding match to Build 35.
Figure 3: Fosmid probes were used for FISH experiments to confirm the R27c mapping of unmatched sequences to Build 35 or to find a location for sequences with inconsistent or no mapping information.

Similar content being viewed by others


  1. Marth, G.T. et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452–456 (1999).

    Article  CAS  Google Scholar 

  2. Tsui, C. et al. Single nucleotide polymorphisms (SNPs) that map to gaps in the human SNP map. Nucleic Acids Res. 31, 4910–4916 (2003).

    Article  CAS  Google Scholar 

  3. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).

    Article  CAS  Google Scholar 

  4. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    Article  CAS  Google Scholar 

  5. Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    Article  CAS  Google Scholar 

  6. Myers, E.W., Sutton, G.G., Smith, H.O., Adams, M.D. & Venter, J.C. On the sequencing and assembly of the human genome. Proc. Natl. Acad. Sci. USA 99, 4145–4146 (2002).

    Article  CAS  Google Scholar 

  7. Adams, M.D., Sutton, G.G., Smith, H.O., Myers, E.W. & Venter, J.C. The independence of our genome assemblies. Proc. Natl. Acad. Sci. USA 100, 3025–3026 (2003).

    Article  CAS  Google Scholar 

  8. Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA 101, 1916–1921 (2004).

    Article  CAS  Google Scholar 

  9. Waterston, R.H., Lander, E.S. & Sulston, J.E. On the sequencing of the human genome. Proc. Natl. Acad. Sci. USA 99, 3712–3716 (2002).

    Article  CAS  Google Scholar 

  10. Waterston, R.H., Lander, E.S. & Sulston, J.E. More on the sequencing of the human genome. Proc. Natl. Acad. Sci. USA 100, 3022–3024 (2003).

    Article  CAS  Google Scholar 

  11. Feuk, L., Carson, A.R. & Scherer, S.W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).

    Article  CAS  Google Scholar 

  12. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000).

    Article  CAS  Google Scholar 

  13. Mobarry, C. & Sutton, G. An assembly-to-assembly comparison tool. in Proceedings of the Third Annual RECOMB Satellite Meeting on DNA Sequencing Technologies and Computation (2003).

    Google Scholar 

  14. Kent, W.J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    Article  CAS  Google Scholar 

  15. Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, D501–D504 (2005).

    Article  CAS  Google Scholar 

  16. Bailey, J.A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).

    Article  CAS  Google Scholar 

  17. Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 (2004).

    Article  CAS  Google Scholar 

  18. Redon, R. et al. Global variation in copy number in the human genome. Nature (in the press).

  19. Wang, J. et al. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum. Mutat. 27, 323–329 (2006).

    Article  Google Scholar 

  20. Hillier, L.W. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003).

    Article  CAS  Google Scholar 

  21. Scherer, S.W. et al. Human chromosome 7: DNA sequence and biology. Science 300, 767–772 (2003).

    Article  CAS  Google Scholar 

  22. Schmutz, J. et al. The DNA sequence and comparative analysis of human chromosome 5. Nature 431, 268–274 (2004).

    Article  CAS  Google Scholar 

  23. Shendure, J., Mitra, R.D., Varma, C. & Church, G.M. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet. 5, 335–344 (2004).

    Article  CAS  Google Scholar 

  24. Bennett, S.T., Barnes, C., Cox, A., Davies, L. & Brown, C. Toward the 1,000 dollars human genome. Pharmacogenomics 6, 373–382 (2005).

    Article  CAS  Google Scholar 

  25. Service, R.F. Gene sequencing. The race for the $1000 genome. Science 311, 1544–1546 (2006).

    Article  CAS  Google Scholar 

  26. Cheung, J. et al. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 4, R25 (2003).

    Article  Google Scholar 

  27. Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article  CAS  Google Scholar 

  28. Feuk, L. et al. Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS Genet. 1, e56 (2005).

    Article  Google Scholar 

  29. Pfaffl, M.W. A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res. 29, e45 (2001).

    Article  CAS  Google Scholar 

  30. Osborne, L.R. et al. A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat. Genet. 29, 321–325 (2001).

    Article  CAS  Google Scholar 

Download references


We thank T. Tang, L. Wong, J. Wittnam, C.-F. Chu and W. Hwang of The Centre for Applied Genomics for technical assistance. Computational analyses were supported by the Shared Hierarchical Academic Research Computing Network (SHARCNET) and the Centre for Computational Biology at the Hospital for Sick Children. The work was supported by Genome Canada/Ontario Genomics Institute, the Canadian Institutes of Health Research (CIHR), the Canada Foundation for Innovation and the McLaughlin Centre for Molecular Medicine (all to S.W.S). L.A. and X.E. are supported by Genoma España and Genome Canada joint R+D+I projects and by the Generalitat de Catalunya (Departament d'Universitats, 2005SGR00008, and Departament de Salut). L.F. is supported by CIHR. S.W.S. is an Investigator of CIHR and International Scholar of Howard Hughes Medical Institute.

Author information

Authors and Affiliations



The study was designed by R.K., S.W.S. and L.F. The GCA algorithm was created by R.K. Sequence alignment and computational analysis was performed by R.K., J.Z., J.R.M, J.W., C.Q., L.A. and R.J.M. FISH analysis was performed by Y.H., A.M.J.G., M.S. and C.L. PCR analysis was performed by M.A.R., L.P., L.A. and L.F. J.Z., J.R.M, J.W., C.Q., H.A., K.J., R.R., M.H., L.A., X.E., C.L., S.W.S. and L.F contributed to the analysis of overlap with genomic features, creation of data sets for such analysis and interpretation of the data. S.W.S. and L.F conceptualized, designed and coordinated the experiments. The paper was written by S.W.S and L.F.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Table 1

Results for MegaBLAST and A2Amapper comparing R27c versus Build 35 and comparing Build 35 versus R27c. (PDF 18 kb)

Supplementary Table 2

List of copy-unmatched sequences identified by GCA; table also shows information on repeat content and re-BLAT versus Build 35, Build 36 and chimpanzee Build 1. (XLS 176 kb)

Supplementary Table 3

Intra- and interscaffold inversions identified by GCA between R27c and Build 35. (PDF 10 kb)

Supplementary Table 4

List of refined set of unmatched sequences used for analysis of overlap with genomic features; all entries in this list with an insertion point were used for genomic overlap analysis. (XLS 5127 kb)

Supplementary Table 5

Analysis of RefSeq genes and mRNAs. (XLS 377 kb)

Supplementary Table 6

Results and details for PCR-based assays. (XLS 34 kb)

Supplementary Table 7

Results and details for fluoresecence in situ hybridization experiments. (PDF 37 kb)

Supplementary Table 8

Results of comparisons of single-base mismatches detected by GCA with dbSNP_125 and with HapMap QC+/QC− SNPs. (PDF 59 kb)

Supplementary Table 9

Comparison between assembly differences and other genomic features. (XLS 82 kb)

Supplementary Methods (PDF 105 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khaja, R., Zhang, J., MacDonald, J. et al. Genome assembly comparison identifies structural variants in the human genome. Nat Genet 38, 1413–1418 (2006).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing