Article | Published:

Using population admixture to help complete maps of the human genome

Nature Genetics volume 45, pages 406414 (2013) | Download Citation

Abstract

Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces using the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning 4 million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified 8 new large interchromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed at the RNA level and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  2. 2.

    et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

  3. 3.

    et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).

  4. 4.

    et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 (2010).

  5. 5.

    et al. Interchromosomal segmental duplications of the pericentromeric region on the human Y chromosome. Genome Res. 15, 195–204 (2005).

  6. 6.

    et al. Islands of euchromatin-like sequence and expressed polymorphic sequences within the short arm of human chromosome 21. Genome Res. 17, 1690–1696 (2007).

  7. 7.

    International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  8. 8.

    Initial impact of the sequencing of the human genome. Nature 470, 187–197 (2011).

  9. 9.

    , , & False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions. Bioinformatics 27, 2144–2146 (2011).

  10. 10.

    , & An assessment of the sequence gaps: unfinished business in a finished human genome. Nat. Rev. Genet. 5, 345–354 (2004).

  11. 11.

    , , & Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32, 314–331 (1980).

  12. 12.

    et al. A genetic linkage map of the human genome. Cell 51, 319–337 (1987).

  13. 13.

    et al. A second-generation linkage map of the human genome. Nature 359, 794–801 (1992).

  14. 14.

    et al. A high-resolution recombination map of the human genome. Nat. Genet. 31, 241–247 (2002).

  15. 15.

    et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nat. Genet. 37, 1113–1118 (2005).

  16. 16.

    , & Admixture mapping comes of age. Annu. Rev. Genomics Hum. Genet. 11, 65–89 (2010).

  17. 17.

    et al. The landscape of recombination in African Americans. Nature 476, 170–175 (2011).

  18. 18.

    et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat. Genet. 43, 847–853 (2011).

  19. 19.

    , & New approaches to disease mapping in admixed populations. Nat. Rev. Genet. 12, 523–528 (2011).

  20. 20.

    et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

  21. 21.

    , , , & GenBank. Nucleic Acids Res. 39, D32–D37 (2011).

  22. 22.

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  23. 23.

    1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  24. 24.

    et al. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study. Ethn. Dis. 15, S6-4-17 (2005).

  25. 25.

    et al. Candidate gene association resource (CARe): design, methods, and proof of concept. Circ. Cardiovasc. Genet. 3, 267–275 (2010).

  26. 26.

    International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).

  27. 27.

    et al. The sequence and analysis of duplication-rich human chromosome 16. Nature 432, 988–994 (2004).

  28. 28.

    et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics 88, 762–771 (2006).

  29. 29.

    , , , & Detection of HYDIN gene duplication in personal genome sequence data. Genomics Inform. 7, 159–162 (2009).

  30. 30.

    et al. Genome-wide association study of white blood cell count in 16,388 African Americans: the Continental Origins and Genetic Epidemiology Network (COGENT). PLoS Genet. 7, e1002108 (2011).

  31. 31.

    et al. Genomic structure of a copy of the human TPTE gene which encompasses 87 kb on the short arm of chromosome 21. Hum. Genet. 107, 127–131 (2000).

  32. 32.

    , , , & Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).

  33. 33.

    et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).

  34. 34.

    et al. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70, 83–100 (2002).

  35. 35.

    et al. The 200-kb segmental duplication on human chromosome 21 originates from a pericentromeric dissemination involving human chromosomes 2, 18 and 13. Gene 312, 51–59 (2003).

  36. 36.

    , , , & BAGE genes generated by juxtacentromeric reshuffling in the Hominidae lineage are under selective pressure. Genomics 81, 391–399 (2003).

  37. 37.

    et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012).

  38. 38.

    et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).

  39. 39.

    BAC Resource Consortium. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409, 953–958 (2001).

  40. 40.

    & Physical and genetic mapping of the human X chromosome centromere: repression of recombination. Genome Res. 8, 100–110 (1998).

  41. 41.

    & Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet. 3, 65–72 (2002).

  42. 42.

    et al. The structure and evolution of centromeric transition regions within the human genome. Nature 430, 857–864 (2004).

  43. 43.

    , , , & Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet. Genome Res. 115, 205–214 (2006).

  44. 44.

    et al. Mutations in potassium channel Kir2.6 cause susceptibility to thyrotoxic hypokalemic periodic paralysis. Cell 140, 88–98 (2010).

  45. 45.

    Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661–669 (2001).

  46. 46.

    et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).

  47. 47.

    et al. Chromosome 1q21.1 contiguous gene deletion is associated with congenital heart disease. Circ. Res. 94, 1429–1435 (2004).

  48. 48.

    International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237–241 (2008).

  49. 49.

    et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–236 (2008).

  50. 50.

    et al. Recurrent rearrangements of chromosome 1q21.1 and variable pediatric phenotypes. N. Engl. J. Med. 359, 1685–1699 (2008).

  51. 51.

    et al. Recurrent reciprocal 1q21.1 deletions and duplications associated with microcephaly or macrocephaly and developmental and behavioral abnormalities. Nat. Genet. 40, 1466–1471 (2008).

  52. 52.

    et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).

  53. 53.

    et al. A truncating mutation in SERPINB6 is associated with autosomal-recessive nonsyndromic sensorineural hearing loss. Am. J. Hum. Genet. 86, 797–804 (2010).

  54. 54.

    , & Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).

  55. 55.

    et al. Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals. Nat. Genet. 43, 745–752 (2011).

  56. 56.

    et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).

  57. 57.

    & Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

  58. 58.

    & Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  59. 59.

    et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

  60. 60.

    et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 40, 1253–1260 (2008).

  61. 61.

    et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5, e1000519 (2009).

  62. 62.

    A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

  63. 63.

    , , & Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).

  64. 64.

    et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

Download references

Acknowledgements

This study was supported by grants RC1 GM091332-01 (S.A.M. and J.G.W.), R01 HG006855 (S.A.M.) and R01DK54931 (G.G. and M.R.P.) from the US National Institutes of Health and by a Smith Family Foundation Award for Excellence in Biomedical Research (S.A.M.).

The Jackson Heart Study is supported and conducted in collaboration with Jackson State University (N01-HC-95170), University of Mississippi Medical Center (N01-HC-95171) and Touglaoo College (N01-HC-95172) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities (NIMHD), with additional support from the National Institute on Biomedical Imaging and Bioengineering (NIBIB).

The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by NHLBI contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C and HHSN268201100012C).

The Coronary Artery Risk Development in Young Adults Study (CARDIA) is conducted and supported by the NHLBI in collaboration with the University of Alabama at Birmingham (N01-HC95095 and N01-HC48047), the University of Minnesota (N01-HC48048), Northwestern University (N01-HC48049) and the Kaiser Foundation Research Institute (N01-HC48050).

MESA, MESA Family and the MESA SHARe project are conducted and supported by the NHLBI in collaboration with the MESA investigators. Support for MESA is provided by contracts N01-HC-95159, through N01-HC-95169, and RR-024156. Funding for MESA Family is provided by grants R01-HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL-071258 and R01-HL-071259. MESA Air is funded by the US Environmental Protection Agency (EPA)–Science to Achieve Results (STAR) Program Grant RD831697. Funding for genotyping was provided by NHLBI contracts N02-HL-6-4278 and N01-HC-65226.

This manuscript does not necessarily reflect the opinions or views of ARIC, CARDIA, JHS, MESA or the NHLBI.

Author information

Affiliations

  1. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Giulio Genovese
    • , Robert E Handsaker
    • , Heng Li
    • , Kimberly Chambert
    • , Alkes L Price
    • , Cynthia C Morton
    • , Martin R Pollak
    •  & Steven A McCarroll
  2. Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA.

    • Giulio Genovese
    • , Robert E Handsaker
    • , Heng Li
    • , Nicolas Altemose
    • , David Reich
    •  & Steven A McCarroll
  3. Division of Nephrology, Department of Medicine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, Massachusetts, USA.

    • Giulio Genovese
    •  & Martin R Pollak
  4. Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Giulio Genovese
    • , Robert E Handsaker
    • , Kimberly Chambert
    •  & Steven A McCarroll
  5. Department of Obstetrics, Gynecology and Reproductive Biology, Brigham and Women's Hospital, Boston, Massachusetts, USA.

    • Amelia M Lindgren
    •  & Cynthia C Morton
  6. Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA.

    • Bogdan Pasaniuc
    •  & Alkes L Price
  7. Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, USA.

    • Cynthia C Morton
  8. Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, Mississippi, USA.

    • James G Wilson

Authors

  1. Search for Giulio Genovese in:

  2. Search for Robert E Handsaker in:

  3. Search for Heng Li in:

  4. Search for Nicolas Altemose in:

  5. Search for Amelia M Lindgren in:

  6. Search for Kimberly Chambert in:

  7. Search for Bogdan Pasaniuc in:

  8. Search for Alkes L Price in:

  9. Search for David Reich in:

  10. Search for Cynthia C Morton in:

  11. Search for Martin R Pollak in:

  12. Search for James G Wilson in:

  13. Search for Steven A McCarroll in:

Contributions

G.G. and S.A.M. conceived the project, designed the analyses and wrote the manuscript. G.G. performed the analysis of the CARe, ICDB, JHS and BodyMap 2.0 data sets. R.E.H. performed the sequence read depth analysis of selected regions. H.L. performed the alignments of HuRef scaffolds and GenBank clones. N.A. contributed the analysis of the HuRef unplaced scaffolds. A.M.L. performed the FISH experiments. K.C. organized and contributed to the design of the Sequenom experiment. B.P., A.L.P. and D.R. provided advice for the local ancestry inference. C.C.M. participated in the interpretation of the FISH experiments. M.R.P. participated in planning discussions for the linkage analysis. J.G.W. participated in planning discussions, coordinated interactions with JHS and edited the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Giulio Genovese or Steven A McCarroll.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Note, Supplementary Tables 1–13 and Supplementary Figures 1–24

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/ng.2565