Article | Published:

Personalized copy number and segmental duplication maps using next-generation sequencing

Nature Genetics volume 41, pages 10611067 (2009) | Download Citation


Despite their importance in gene innovation and phenotypic variation, duplicated regions have remained largely intractable owing to difficulties in accurately resolving their structure, copy number and sequence content. We present an algorithm (mrFAST) to comprehensively map next-generation sequence reads, which allows for the prediction of absolute copy-number variation of duplicated segments and genes. We examine three human genomes and experimentally validate genome-wide copy number differences. We estimate that, on average, 73–87 genes vary in copy number between any two individuals and find that these genic differences overwhelmingly correspond to segmental duplications (odds ratio = 135; P < 2.2 × 10−16). Our method can distinguish between different copies of highly identical genes, providing a more accurate assessment of gene content and insight into functional constraint without the limitations of array-based technology.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


  1. 1.

    et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).

  2. 2.

    et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 (2004).

  3. 3.

    et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

  4. 4.

    et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).

  5. 5.

    et al. FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity. Nat. Genet. 39, 721–723 (2007).

  6. 6.

    et al. Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature 439, 851–855 (2006).

  7. 7.

    et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307, 1434–1440 (2005).

  8. 8.

    et al. A chromosome 8 gene-cluster polymorphism with low human beta-defensin 2 gene copy number predisposes to Crohn disease of the colon. Am. J. Hum. Genet. 79, 439–448 (2006).

  9. 9.

    et al. Gene copy-number variation and associated polymorphisms of complement component C4 in human systemic lupus erythematosus (SLE): low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans. Am. J. Hum. Genet. 80, 1037–1054 (2007).

  10. 10.

    et al. Psoriasis is associated with increased beta-defensin genomic copy number. Nat. Genet. 40, 23–25 (2008).

  11. 11.

    et al. Chromosomal regions containing high-density and ambiguously mapped putative single nucleotide polymorphisms (SNPs) correlate with segmental duplications in the human genome. Hum. Mol. Genet. 11, 1987–1995 (2002).

  12. 12.

    et al. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am. J. Hum. Genet. 79, 275–290 (2006).

  13. 13.

    , , , & Systematic assessment of copy number variant detection via genome-wide SNP genotyping. Nat. Genet. 40, 1199–1203 (2008).

  14. 14.

    et al. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome Res. 13, 347–357 (2003).

  15. 15.

    et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).

  16. 16.

    et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).

  17. 17.

    et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).

  18. 18.

    et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

  19. 19.

    et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 40, 722–729 (2008).

  20. 20.

    et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

  21. 21.

    et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat. Methods 6, 99–103 (2009).

  22. 22.

    et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

  23. 23.

    , & Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

  24. 24.

    et al. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5, 183–188 (2008).

  25. 25.

    , , , & Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

  26. 26.

    Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 707–710 (1966).

  27. 27.

    On approximate string matching. in Fundamentals of Computation Theory, Proceedings of the 1983 International FCT Conference 487–495 (Springer-Verlag, London, 1983).

  28. 28.

    , & RepeatMasker Open-3.0. (1996–2004).

  29. 29.

    Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

  30. 30.

    , , & WindowMasker: window-based masker for sequenced genomes. Bioinformatics 22, 134–141 (2006).

  31. 31.

    et al. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res. 18, 1638–1642 (2008).

  32. 32.

    et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431, 927–930 (2004).

  33. 33.

    et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA 101, 1916–1921 (2004).

  34. 34.

    , , , & Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).

  35. 35.

    et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 1166–1174 (2008).

  36. 36.

    , & Molecular definition of the extreme size polymorphism in apolipoprotein(a). Hum. Mol. Genet. 2, 933–940 (1993).

  37. 37.

    , & NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).

  38. 38.

    et al. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 (2007).

  39. 39.

    et al. A burst of segmental duplications in the genome of the African great ape ancestor. Nature 457, 877–881 (2009).

  40. 40.

    et al. High-resolution mapping of human chromosome 11 by in situ hybridization with cosmid clones. Science 247, 64–69 (1990).

Download references


We thank D. Bentley for early access to the Illumina WGS dataset for NA18507; J. Wang for the YH DNA and the cell line; M. Egholm and B. Simen for the JDW DNA and J.D. Watson for permission to analyze his genome. We also thank M. Shumway, P. Flicek and R. Leinonen for technical assistance in transferring large sequence datasets; E. Tüzün for help in parallelizing mrFAST for computation clusters through message passing interface; S. Girirajan for assistance with experiments and T. Brown for her help in manuscript preparation. J.M.K. is supported by a US National Science Foundation Graduate Research Fellowship. T.M.-B. is supported by a Marie Curie fellowship (FP7). This work was supported, in part, by U.S. National Institutes of Health grant HG004120 to E.E.E. E.E.E. is an investigator of the Howard Hughes Medical Institute.

Author information


  1. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA.

    • Can Alkan
    • , Jeffrey M Kidd
    • , Tomas Marques-Bonet
    • , Gozde Aksay
    • , Francesca Antonacci
    • , Jacob O Kitzman
    • , Carl Baker
    • , Maika Malig
    •  & Evan E Eichler
  2. Howard Hughes Medical Institute, Seattle, Washington, USA.

    • Can Alkan
    •  & Evan E Eichler
  3. Institut de Biologia Evolutiva (UPF-CSIC), Barcelona, Catalonia, Spain.

    • Tomas Marques-Bonet
  4. School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.

    • Fereydoun Hormozdiari
    •  & S Cenk Sahinalp
  5. Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

    • Onur Mutlu
  6. Baylor College of Medicine, Houston, Texas, USA.

    • Richard A Gibbs


  1. Search for Can Alkan in:

  2. Search for Jeffrey M Kidd in:

  3. Search for Tomas Marques-Bonet in:

  4. Search for Gozde Aksay in:

  5. Search for Francesca Antonacci in:

  6. Search for Fereydoun Hormozdiari in:

  7. Search for Jacob O Kitzman in:

  8. Search for Carl Baker in:

  9. Search for Maika Malig in:

  10. Search for Onur Mutlu in:

  11. Search for S Cenk Sahinalp in:

  12. Search for Richard A Gibbs in:

  13. Search for Evan E Eichler in:


C.A., J.M.K., T.M.-B. and E.E.E. designed the study, performed analytical work and wrote the manuscript. C.A., F.H. and O.M. designed and implemented the mrFAST algorithm. C.A., J.M.K., G.A. and J.O.K. performed computational analysis. T.M.-B., F.A., C.B. and M.M. performed validation experiments. R.A.G. advised on handling of JDW data analysis. S.C.S. and E.E.E. obtained funding for the study.

Corresponding author

Correspondence to Evan E Eichler.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Note, Supplementary Figures 1–7, and Supplementary Tables 1–3 and 6

Excel files

  1. 1.

    Supplementary Table 4

    Estimated diploid copy number for 17,601 autosomal coding genes

  2. 2.

    Supplementary Table 5

    Individual exons which are estimated to be copy-number variable among the three analyzed individuals

About this article

Publication history





Further reading