Abstract
The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 new insertion sequences corresponding to 720 genomic loci. We found that a substantial fraction of these sequences are either missing, fragmented or misassigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determined that 18–37% of these new insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identified new exons and conserved noncoding sequences not yet represented in the reference genome. We developed a method to accurately genotype these new insertions by mapping next-generation sequencing datasets to the breakpoint, thereby providing a means to characterize copy-number status for regions previously inaccessible to single-nucleotide polymorphism microarrays.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. published online, doi:10.1101/gr.091868.109 (22 June 2009).
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Hormozdiari, F., Alkan, C., Eichler, E.E. & Sahinalp, S.C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).
Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).
Eichler, E.E. et al. Completing the map of human genetic variation. Nature 447, 161–165 (2007).
Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).
Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nat. Genet. 40, 96–101 (2008).
The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005).
Perry, G.H. et al. The fine-scale and complex architecture of human copy-number variation. Am. J. Hum. Genet. 82, 685–695 (2008).
Weir, B.S. Genetic Data Analysis II (Sinauer, Sunderland, Massachusetts, USA, 1996).
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
Enattah, N.S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet. 30, 233–237 (2002).
Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32–D36 (2009).
Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 1814–1828 (2008).
Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 18, 1829–1843 (2008).
Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
McCarroll, S.A. et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 1166–1174 (2008).
Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2009).
Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).
Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061–1067 (2009).
Parsons, J.D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995).
Acknowledgements
We thank C. Campbell, G. Cooper, T. Marques-Bonet for thoughtful discussion, P. Sudmant for assistance with Illumina sequence data and members of the University of Washington and Washington University Genomes Centers for assistance with data generation. J.M.K. is supported by a US National Science Foundation Graduate Research Fellowship. This work was supported by the US National Institutes of Health grant HG004120 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.
Author information
Authors and Affiliations
Contributions
J.M.K., N.S., F.A., A.T., R.K. and E.E.E. analyzed data. N.S., P.A., A.T., N.A.Y., P.T. and L.B. performed array CGH and copy-number analysis. F.A., M.V. and G.G. performed FISH experiments. C.A. assembled contigs. T.G., R.F., H.S.H., M.M., J.K., R.K. and R.K.W. performed clone characterization and sequencing. J.M.K., R.K., L.B. and E.E.E. designed the study. J.M.K. and E.E.E. wrote the paper with contributions from the other authors.
Corresponding author
Ethics declarations
Competing interests
N.S., P.A., A.T., N.A.Y., P.T. and L.B. are employees of Agilent Technologies. E.E.E. is a scientific advisory board member for Pacific Biosciences.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–7 and Supplementary Tables 2,3,5,9,11,12, Supplementary Note (PDF 10023 kb)
Supplementary Table 1
Contig information (XLS 795 kb)
Supplementary Table 4
Summary of unassembled OEA sequences (XLS 3949 kb)
Supplementary Table 6
Assigned copy-number genotypes (XLS 151 kb)
Supplementary Table 7
Insertion allele frequencies (XLS 76 kb)
Supplementary Table 8
Novel insertions with high FST (XLS 21 kb)
Supplementary Table 10
Sequenced fosmid clones (XLS 64 kb)
Supplementary Table 13
Exons within insertion sequences (XLS 19 kb)
Supplementary Table 14
Breakpoint k-mer genotyping results (XLS 20 kb)
Rights and permissions
About this article
Cite this article
Kidd, J., Sampas, N., Antonacci, F. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods 7, 365–371 (2010). https://doi.org/10.1038/nmeth.1451
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.1451
This article is cited by
-
SV-STAT accurately detects structural variation via alignment to reference-based assemblies
Source Code for Biology and Medicine (2016)
-
Distinctive expansion of gene families associated with plant cell wall degradation, secondary metabolism, and nutrient uptake in the genomes of grapevine trunk pathogens
BMC Genomics (2015)
-
Reducing INDEL calling errors in whole genome and exome sequencing data
Genome Medicine (2014)
-
Determining the quality and complexity of next-generation sequencing data without a reference genome
Genome Biology (2014)
-
Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications
Nature Genetics (2014)