Characterization of missing human genome sequences and copy-number polymorphic insertions

Kidd, Jeffrey M; Sampas, Nick; Antonacci, Francesca; Graves, Tina; Fulton, Robert; Hayden, Hillary S; Alkan, Can; Malig, Maika; Ventura, Mario; Giannuzzi, Giuliana; Kallicki, Joelle; Anderson, Paige; Tsalenko, Anya; Yamada, N Alice; Tsang, Peter; Kaul, Rajinder; Wilson, Richard K; Bruhn, Laurakay; Eichler, Evan E

doi:10.1038/nmeth.1451

Resource
Published: 18 April 2010

Characterization of missing human genome sequences and copy-number polymorphic insertions

Jeffrey M Kidd¹,
Nick Sampas²,
Francesca Antonacci¹,
Tina Graves³,
Robert Fulton³,
Hillary S Hayden¹,
Can Alkan¹,
Maika Malig¹,
Mario Ventura⁴,
Giuliana Giannuzzi⁴,
Joelle Kallicki³,
Paige Anderson²,
Anya Tsalenko²,
N Alice Yamada²,
Peter Tsang²,
Rajinder Kaul¹,
Richard K Wilson³,
Laurakay Bruhn² &
…
Evan E Eichler^1,5

Nature Methods volume 7, pages 365–371 (2010)Cite this article

1118 Accesses
102 Citations
26 Altmetric
Metrics details

Subjects

Abstract

The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18–37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Copy-number polymorphism of novel insertions.**

**Figure 2: Sequencing and genotyping insertions.**

**Figure 3: Insertion allele frequency distribution.**

**Figure 4: Annotation of conserved and functional elements.**

**Figure 5: Genotyping sequenced variants through unique k-mer matches.**

Towards a reference genome that captures global genetic diversity

Article Open access 30 October 2020

Mapping and characterization of structural variation in 17,795 human genomes

Article 27 May 2020

A draft human pangenome reference

Article Open access 10 May 2023

Accession codes

Accessions

Gene Expression Omnibus

GSE20634

References

International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Article Google Scholar
Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
Article CAS Google Scholar
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS Google Scholar
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
Article CAS Google Scholar
McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. published online, doi:10.1101/gr.091868.109 (22 June 2009).
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Article CAS Google Scholar
Hormozdiari, F., Alkan, C., Eichler, E.E. & Sahinalp, S.C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).
Article CAS Google Scholar
Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).
Article CAS Google Scholar
Eichler, E.E. et al. Completing the map of human genetic variation. Nature 447, 161–165 (2007).
Article CAS Google Scholar
Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).
Article CAS Google Scholar
Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nat. Genet. 40, 96–101 (2008).
Article CAS Google Scholar
The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005).
Perry, G.H. et al. The fine-scale and complex architecture of human copy-number variation. Am. J. Hum. Genet. 82, 685–695 (2008).
Article CAS Google Scholar
Weir, B.S. Genetic Data Analysis II (Sinauer, Sunderland, Massachusetts, USA, 1996).
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
Article CAS Google Scholar
Enattah, N.S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet. 30, 233–237 (2002).
Article CAS Google Scholar
Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32–D36 (2009).
Article CAS Google Scholar
Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Article CAS Google Scholar
Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 1814–1828 (2008).
Article CAS Google Scholar
Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 18, 1829–1843 (2008).
Article CAS Google Scholar
Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Article CAS Google Scholar
McCarroll, S.A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 1166–1174 (2008).
Article CAS Google Scholar
Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2009).
Article Google Scholar
Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).
Article CAS Google Scholar
Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061–1067 (2009).
Article CAS Google Scholar
Parsons, J.D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995).
CAS PubMed Google Scholar

Download references

Acknowledgements

We thank C. Campbell, G. Cooper, T. Marques-Bonet for thoughtful discussion, P. Sudmant for assistance with Illumina sequence data and members of the University of Washington and Washington University Genomes Centers for assistance with data generation. J.M.K. is supported by a US National Science Foundation Graduate Research Fellowship. This work was supported by the US National Institutes of Health grant HG004120 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, USA
Jeffrey M Kidd, Francesca Antonacci, Hillary S Hayden, Can Alkan, Maika Malig, Rajinder Kaul & Evan E Eichler
Agilent Laboratories, Santa Clara, California, USA
Nick Sampas, Paige Anderson, Anya Tsalenko, N Alice Yamada, Peter Tsang & Laurakay Bruhn
Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri, USA
Tina Graves, Robert Fulton, Joelle Kallicki & Richard K Wilson
Department of Genetics and Microbiology, University of Bari, Bari, Italy
Mario Ventura & Giuliana Giannuzzi
Howard Hughes Medical Institute, Seattle, Washington, USA
Evan E Eichler

Authors

Jeffrey M Kidd
View author publications
You can also search for this author in PubMed Google Scholar
Nick Sampas
View author publications
You can also search for this author in PubMed Google Scholar
Francesca Antonacci
View author publications
You can also search for this author in PubMed Google Scholar
Tina Graves
View author publications
You can also search for this author in PubMed Google Scholar
Robert Fulton
View author publications
You can also search for this author in PubMed Google Scholar
Hillary S Hayden
View author publications
You can also search for this author in PubMed Google Scholar
Can Alkan
View author publications
You can also search for this author in PubMed Google Scholar
Maika Malig
View author publications
You can also search for this author in PubMed Google Scholar
Mario Ventura
View author publications
You can also search for this author in PubMed Google Scholar
Giuliana Giannuzzi
View author publications
You can also search for this author in PubMed Google Scholar
Joelle Kallicki
View author publications
You can also search for this author in PubMed Google Scholar
Paige Anderson
View author publications
You can also search for this author in PubMed Google Scholar
Anya Tsalenko
View author publications
You can also search for this author in PubMed Google Scholar
N Alice Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Peter Tsang
View author publications
You can also search for this author in PubMed Google Scholar
Rajinder Kaul
View author publications
You can also search for this author in PubMed Google Scholar
Richard K Wilson
View author publications
You can also search for this author in PubMed Google Scholar
Laurakay Bruhn
View author publications
You can also search for this author in PubMed Google Scholar
Evan E Eichler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.M.K., N.S., F.A., A.T., R.K. and E.E.E. analyzed data. N.S., P.A., A.T., N.A.Y., P.T. and L.B. performed array CGH and copy-number analysis. F.A., M.V. and G.G. performed FISH experiments. C.A. assembled contigs. T.G., R.F., H.S.H., M.M., J.K., R.K. and R.K.W. performed clone characterization and sequencing. J.M.K., R.K., L.B. and E.E.E. designed the study. J.M.K. and E.E.E. wrote the paper with contributions from the other authors.

Corresponding author

Correspondence to Evan E Eichler.

Ethics declarations

Competing interests

N.S., P.A., A.T., N.A.Y., P.T. and L.B. are employees of Agilent Technologies. E.E.E. is a scientific advisory board member for Pacific Biosciences.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7 and Supplementary Tables 2,3,5,9,11,12, Supplementary Note (PDF 10023 kb)

Supplementary Table 1

Contig information (XLS 795 kb)

Supplementary Table 4

Summary of unassembled OEA sequences (XLS 3949 kb)

Supplementary Table 6

Assigned copy-number genotypes (XLS 151 kb)

Supplementary Table 7

Insertion allele frequencies (XLS 76 kb)

Supplementary Table 8

Novel insertions with high F_ST (XLS 21 kb)

Supplementary Table 10

Sequenced fosmid clones (XLS 64 kb)

Supplementary Table 13

Exons within insertion sequences (XLS 19 kb)

Supplementary Table 14

Breakpoint k-mer genotyping results (XLS 20 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kidd, J., Sampas, N., Antonacci, F. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods 7, 365–371 (2010). https://doi.org/10.1038/nmeth.1451

Download citation

Received: 09 December 2009
Accepted: 19 March 2010
Published: 18 April 2010
Issue Date: May 2010
DOI: https://doi.org/10.1038/nmeth.1451

This article is cited by

SV-STAT accurately detects structural variation via alignment to reference-based assemblies
- Caleb F. Davis
- Deborah I. Ritter
- Ching C. Lau
Source Code for Biology and Medicine (2016)
Distinctive expansion of gene families associated with plant cell wall degradation, secondary metabolism, and nutrient uptake in the genomes of grapevine trunk pathogens
- Abraham Morales-Cruz
- Katherine C. H. Amrine
- Dario Cantu
BMC Genomics (2015)
Reducing INDEL calling errors in whole genome and exome sequencing data
- Han Fang
- Yiyang Wu
- Gholson J Lyon
Genome Medicine (2014)
Determining the quality and complexity of next-generation sequencing data without a reference genome
- Seyed Yahya Anvar
- Lusine Khachatryan
- Jeroen FJ Laros
Genome Biology (2014)
Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications
- Andy Rimmer
- Hang Phan
- Gerton Lunter
Nature Genetics (2014)