Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Diversity in non-repetitive human sequences not found in the reference genome

Abstract

Genomes usually contain some non-repetitive sequences that are missing from the reference genome and occur only in a population subset. Such non-repetitive, non-reference (NRNR) sequences have remained largely unexplored in terms of their characterization and downstream analyses. Here we describe 3,791 breakpoint-resolved NRNR sequence variants called using PopIns from whole-genome sequence data of 15,219 Icelanders. We found that over 95% of the 244 NRNR sequences that are 200 bp or longer are present in chimpanzees, indicating that they are ancestral. Furthermore, 149 variant loci are in linkage disequilibrium (r2 > 0.8) with a genome-wide association study (GWAS) catalog marker, suggesting disease relevance. Additionally, we report an association (P = 3.8 × 10−8, odds ratio (OR) = 0.92) with myocardial infarction (23,360 cases, 300,771 controls) for a 766-bp NRNR sequence variant. Our results underline the importance of including variation of all complexity levels when searching for variants that associate with disease.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Workflow for calling non-reference, non-repetitive sequence variants.
Figure 2: Frequency, length and examples of non-reference, non-repetitive sequence variants.
Figure 3: The 766-bp NRNR sequence variant (NRNR1361) affecting an intron of SREBF1 that associates with myocardial infarction.

Similar content being viewed by others

Accession codes

Primary accessions

NCBI Reference Sequence

References

  1. Alkan, C., Coe, B.P. & Eichler, E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Mills, R.E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Kloosterman, W.P. et al. Chromothripsis as a mechanism driving complex de novo structural rearrangements in the germline. Hum. Mol. Genet. 20, 1916–1924 (2011).

    Article  CAS  PubMed  Google Scholar 

  4. Chaisson, M.J.P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Hehir-Kwa, J.Y. et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nat. Commun. 7, 12989 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. USA 113, 11901–11906 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Kehr, B., Melsted, P. & Halldórsson, B.V. PopIns: population-scale detection of novel sequence insertions. Bioinformatics 32, 961–967 (2016).

    Article  CAS  PubMed  Google Scholar 

  10. Schneider, V.A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. bioRxiv http://dx.doi.org/10.1101/072116 (2016).

  11. Gudbjartsson, D.F. et al. Sequence variants from whole genome sequencing a large group of Icelanders. Sci. Data 2, 150011 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Genovese, G. et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 45, 406–414 (2013).

    Article  CAS  PubMed  Google Scholar 

  13. Kong, A. et al. A high-resolution map of the human genome. Nat. Genet. 31, 241–247 (2002).

    Article  CAS  PubMed  Google Scholar 

  14. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  15. Venter, C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    Article  CAS  PubMed  Google Scholar 

  16. Abyzov, A. et al. Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms. Nat. Commun. 6, 7256 (2015).

    Article  CAS  PubMed  Google Scholar 

  17. Levy, S. et al. The Diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP–trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

    Article  CAS  PubMed  Google Scholar 

  19. Olesen, M.S., Nielsen, M.W., Haunsø, S. & Svendsen, J.H. Atrial fibrillation: the role of common and rare genetic variants. Eur. J. Hum. Genet. 22, 297–306 (2014).

    Article  CAS  PubMed  Google Scholar 

  20. Osborne, T.F. Sterol regulatory element–binding proteins (SREBPs): key regulators of nutritional homeostasis and insulin action. J. Biol. Chem. 275, 32379–32382 (2000).

    Article  CAS  PubMed  Google Scholar 

  21. Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333–338 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. arXiv https://arxiv.org/abs/1404.5010 (2014).

  24. Eichler, E.E. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Kong, A. et al. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099–1103 (2010).

    Article  CAS  PubMed  Google Scholar 

  27. Estrada, K. et al. Genome-wide meta-analysis identifies 56 bone mineral density loci and reveals 14 loci associated with risk of fracture. Nat. Genet. 44, 491–501 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. McMahon, F.J. et al. Meta-analysis of genome-wide association data identifies a risk locus for major mood disorders on 3p21.1. Nat. Genet. 42, 128–131 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. arcOGEN Consortium & arcOGEN Collaborators. Identification of new susceptibility loci for osteoarthritis (arcOGEN): a genome-wide association study. Lancet 380, 815–823 (2012).

  30. Manning, A.K. et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat. Genet. 44, 659–669 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Cai, Q. et al. Genome-wide association analysis in East Asians identifies breast cancer susceptibility loci at 1q32.1, 5q14.3 and 15q26.1. Nat. Genet. 46, 886–890 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Caporaso, N. et al. Genome-wide and candidate gene association study of cigarette smoking behaviors. PLoS One 4, e4653 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Wood, A.R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173–1186 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Shin, S.Y. et al. An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Trégouët, D.A. et al. Genome-wide haplotype association study identifies the SLC22A3LPAL2LPA gene cluster as a risk locus for coronary artery disease. Nat. Genet. 41, 283–285 (2009).

    Article  PubMed  Google Scholar 

  36. Perry, J.R. et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature 514, 92–97 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Elks, C.E. et al. Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies. Nat. Genet. 42, 1077–1085 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Brown, C.C. et al. A genome-wide association analysis of temozolomide response using lymphoblastoid cell lines shows a clinically relevant association with MGMT. Pharmacogenet. Genomics 22, 796–802 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Gudbjartsson, D.F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).

    Article  CAS  PubMed  Google Scholar 

  41. Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).

    Article  CAS  PubMed  Google Scholar 

  42. Untergasser, A. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Döring, A., Weese, D., Rausch, T. & Reinert, K. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 11 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Kehr, B. et al. STELLAR: fast and exact local alignments. BMC Bioinformatics 12, S15 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Gu∂´ bjartsson, H. et al. GORpipe: a query tool for working with sequence data based on a Genomic Ordered Relational (GOR) architecture. Bioinformatics 32, 3081–3088 (2016).

    Article  Google Scholar 

  46. Styrkarsdottir, U. et al. Nonsense mutation in the LGR4 gene is associated with several human diseases and other traits. Nature 497, 517–520 (2013).

    Article  CAS  PubMed  Google Scholar 

  47. Helgadottir, A. et al. Variants with large effects on blood lipids and the role of cholesterol and triglycerides in coronary disease. Nat. Genet. 48, 634–639 (2016).

    Article  CAS  PubMed  Google Scholar 

  48. Gretarsdottir, S. et al. A splice region variant in LDLR lowers non–high density lipoprotein cholesterol and protects against coronary artery disease. PLoS Genet. 11, e1005379 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Anders, S., Pyl, P.T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

    Article  CAS  PubMed  Google Scholar 

  51. Robinson, M.D. et al. A scaling normalization method for differential expression analysis of RNA–seq data. Genome Biol. 11, R25 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Benson, D.A. et al. GenBank. Nucleic Acids Res. 45, D37–D42 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

B.K., P.M., B.V.H. and K.S. designed the experiments. B.K., P.M., A.G., S.K., D.F.G. and B.V.H. implemented the methodology and analyzed the call set. B.K., A. Helgadottir, H. Holm, P.S., D.F.G. and B.V.H. interpreted the association results. B.K., H. Helgason and G.H.H. analyzed gene expression. Aslaug Jonasdottir, Adalbjorg Jonasdottir, and A.S. performed PCR verification and Sanger sequencing. U.T. oversaw the operations of the genotyping facilities. G.T., I.O., H. Holm and U.T. were responsible for phenotype data acquisition. B.K. prepared tables and figures. B.K., H.J., A. Helgason and B.V.H. wrote the manuscript. All authors reviewed and approved the final manuscript. K.S. supervised the study.

Corresponding authors

Correspondence to Bjarni V Halldorsson or Kari Stefansson.

Ethics declarations

Competing interests

B.K., A. Helgadottir, P.M., H.J., H. Helgason, Adalbjorg Jonasdottir, Aslaug Jonasdottir, A.S., A.G., G.H.H., S.K., H. Holm, P.S., U.T., A. Helgason, D.F.G., B.V.H. and K.S. are all employees of deCODE Genetics/Amgen, Inc.

Integrated supplementary information

Supplementary Figure 1 Primer pairs designed for the five categories of NRNR sequence variants for validation by Sanger sequencing.

For the categories “INS > 200” and “DEL > INS”, two or three primer pairs were designed including at least one for each allele. For “INS < 200”, only a single primer pair was designed that may amplify in both alleles. For “Different contig” always three and for “Singleton” always two primer pairs were designed.

Supplementary information

Supplementary Text and Figures

Supplementary Figure 1, Supplementary Tables 2, 7 and 8, and Supplementary Note (PDF 1950 kb)

Supplementary Data 1: NRNR sequences anchored by imputed NRNR markers.

Sequences are given in FASTA format. (TXT 2129 kb)

Supplementary Data 2: NRNR sequences anchored by fixed NRNR markers.

Fixed are those markers that were predicted to have 100% frequency in Iceland. Sequences are given in FASTA format. (TXT 288 kb)

Supplementary Table 1

List of imputed NRNR markers. (XLSX 623 kb)

Supplementary Table 3

Details of Sanger sequencing validation experiments. (XLSX 65 kb)

Supplementary Table 4

List of fixed NRNR markers. (XLSX 48 kb)

Supplementary Table 5

Overlap of NRNR markers with known variants and sequences. (XLSX 358 kb)

Supplementary Table 6

Correlation with the GWAS catalog. (XLSX 84 kb)

Supplementary Table 9

Conversion to GenBank sequences. (XLSX 163 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kehr, B., Helgadottir, A., Melsted, P. et al. Diversity in non-repetitive human sequences not found in the reference genome. Nat Genet 49, 588–593 (2017). https://doi.org/10.1038/ng.3801

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3801

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research