Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Resource
  • Published:

Characterization of missing human genome sequences and copy-number polymorphic insertions

Abstract

The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18–37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Copy-number polymorphism of novel insertions.
Figure 2: Sequencing and genotyping insertions.
Figure 3: Insertion allele frequency distribution.
Figure 4: Annotation of conserved and functional elements.
Figure 5: Genotyping sequenced variants through unique k-mer matches.

Similar content being viewed by others

Accession codes

Accessions

Gene Expression Omnibus

References

  1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  2. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    Article  Google Scholar 

  3. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

    Article  CAS  Google Scholar 

  4. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    Article  CAS  Google Scholar 

  5. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

    Article  CAS  Google Scholar 

  6. McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. published online, doi:10.1101/gr.091868.109 (22 June 2009).

  7. Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).

    Article  CAS  Google Scholar 

  8. Hormozdiari, F., Alkan, C., Eichler, E.E. & Sahinalp, S.C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).

    Article  CAS  Google Scholar 

  9. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).

    Article  CAS  Google Scholar 

  10. Eichler, E.E. et al. Completing the map of human genetic variation. Nature 447, 161–165 (2007).

    Article  CAS  Google Scholar 

  11. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).

    Article  CAS  Google Scholar 

  12. Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nat. Genet. 40, 96–101 (2008).

    Article  CAS  Google Scholar 

  13. The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005).

  14. Perry, G.H. et al. The fine-scale and complex architecture of human copy-number variation. Am. J. Hum. Genet. 82, 685–695 (2008).

    Article  CAS  Google Scholar 

  15. Weir, B.S. Genetic Data Analysis II (Sinauer, Sunderland, Massachusetts, USA, 1996).

  16. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

    Article  CAS  Google Scholar 

  17. Enattah, N.S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet. 30, 233–237 (2002).

    Article  CAS  Google Scholar 

  18. Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32–D36 (2009).

    Article  CAS  Google Scholar 

  19. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    Article  CAS  Google Scholar 

  20. Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 1814–1828 (2008).

    Article  CAS  Google Scholar 

  21. Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 18, 1829–1843 (2008).

    Article  CAS  Google Scholar 

  22. Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).

    Article  CAS  Google Scholar 

  23. McCarroll, S.A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 1166–1174 (2008).

    Article  CAS  Google Scholar 

  24. Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2009).

    Article  Google Scholar 

  25. Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).

    Article  CAS  Google Scholar 

  26. Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061–1067 (2009).

    Article  CAS  Google Scholar 

  27. Parsons, J.D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank C. Campbell, G. Cooper, T. Marques-Bonet for thoughtful discussion, P. Sudmant for assistance with Illumina sequence data and members of the University of Washington and Washington University Genomes Centers for assistance with data generation. J.M.K. is supported by a US National Science Foundation Graduate Research Fellowship. This work was supported by the US National Institutes of Health grant HG004120 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

Authors

Contributions

J.M.K., N.S., F.A., A.T., R.K. and E.E.E. analyzed data. N.S., P.A., A.T., N.A.Y., P.T. and L.B. performed array CGH and copy-number analysis. F.A., M.V. and G.G. performed FISH experiments. C.A. assembled contigs. T.G., R.F., H.S.H., M.M., J.K., R.K. and R.K.W. performed clone characterization and sequencing. J.M.K., R.K., L.B. and E.E.E. designed the study. J.M.K. and E.E.E. wrote the paper with contributions from the other authors.

Corresponding author

Correspondence to Evan E Eichler.

Ethics declarations

Competing interests

N.S., P.A., A.T., N.A.Y., P.T. and L.B. are employees of Agilent Technologies. E.E.E. is a scientific advisory board member for Pacific Biosciences.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7 and Supplementary Tables 2,3,5,9,11,12, Supplementary Note (PDF 10023 kb)

Supplementary Table 1

Contig information (XLS 795 kb)

Supplementary Table 4

Summary of unassembled OEA sequences (XLS 3949 kb)

Supplementary Table 6

Assigned copy-number genotypes (XLS 151 kb)

Supplementary Table 7

Insertion allele frequencies (XLS 76 kb)

Supplementary Table 8

Novel insertions with high FST (XLS 21 kb)

Supplementary Table 10

Sequenced fosmid clones (XLS 64 kb)

Supplementary Table 13

Exons within insertion sequences (XLS 19 kb)

Supplementary Table 14

Breakpoint k-mer genotyping results (XLS 20 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kidd, J., Sampas, N., Antonacci, F. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods 7, 365–371 (2010). https://doi.org/10.1038/nmeth.1451

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.1451

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research