Abstract
Long-read sequencing (LRS) promises to improve the characterization of structural variants (SVs). We generated LRS data from 3,622 Icelanders and identified a median of 22,636 SVs per individual (a median of 13,353 insertions and 9,474 deletions). We discovered a set of 133,886 reliably genotyped SV alleles and imputed them into 166,281 individuals to explore their effects on diseases and other traits. We discovered an association of a rare deletion in PCSK9 with lower low-density lipoprotein (LDL) cholesterol levels, compared to the population average. We also discovered an association of a multiallelic SV in ACAN with height; we found 11 alleles that differed in the number of a 57-bp-motif repeat and observed a linear relationship between the number of repeats carried and height. These results show that SVs can be accurately characterized at the population scale using LRS data in a genome-wide non-targeted approach and demonstrate how SVs impact phenotypes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Access to these data is controlled; the sequence data cannot be made publicly available because Icelandic law and the regulations of the Icelandic Data Protection Authority prohibit the release of individual-level and personally identifying data. Data access can be granted only at the facilities of deCODE genetics in Iceland, subject to Icelandic law regarding data usage. Anyone wishing to gain access to the data should contact K.S. (kstefans@decode.is). Icelandic law allows for unimpeded sharing of summary-level data. Data access consists of Supplementary Data 1–5 as described below, alongside the VCF and index files for the high-confidence SV alleles at https://github.com/DecodeGenetics/LRS_SV_sets.
Code availability
Codes are available as follows: SViper, modified, used in this study (https://github.com/DecodeGenetics/SViper/tree/cornercases); SViper, original repository (https://github.com/smehringer/SViper); Scrappie, modified, used in this study (https://github.com/DecodeGenetics/scrappie/tree/v1.3.0.events); Scrappie, original repository (https://github.com/nanoporetech/scrappie); SquiggleSVFilter (https://github.com/DecodeGenetics/nanopolish/tree/squigglesv); sample execution of SquiggleSVFilter with input and expected output data (https://github.com/DecodeGenetics/SquiggleSV_samplerun); sv-merger, to form SV cliques using the Cluster Affinity Search Technique algorithm (https://github.com/DecodeGenetics/sv-merger); LRcaller (https://github.com/DecodeGenetics/LRcaller).
References
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013).
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Kloosterman, W. P. et al. Characteristics of de novo structural changes in the human genome. Genome Res. 25, 792–801 (2015).
Abel, H. J. et al. Mapping and characterization of structural variation in 17,795 deeply sequenced human genomes. Nature 583, 83–89 (2020).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Stancu, M. C. et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat. Commun. 8, 1326 (2017).
De Coster, W. et al. Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome. Genome Res. 29, 1178–1187 (2019).
Gilpatrick, T. et al. Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat. Biotechnol. 38, 433–438 (2020).
Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675 (2019).
Gudbjartsson, D. F. et al. Sequence variants from whole genome sequencing a large group of Icelanders. Sci. Data 2, 150011 (2015).
Jónsson, H. et al. Whole genome characterization of sequence diversity of 15,220 Icelanders. Sci. Data 4, 170115 (2017).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Mehringer, S. et al. SViper: a tool for SV polishing. Prep. (2019).
Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).
Eggertsson, H. et al. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat. Commun. 10, 5402 (2019).
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).
Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12, 780–786 (2015).
Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Sulovari, A. et al. Human-specific tandem repeat expansion and differential gene expression during primate evolution. Proc. Natl Acad. Sci. USA 116, 23243–23253 (2019).
Duitama, J. et al. Large-scale analysis of tandem repeat variability in the human genome. Nucleic Acids Res. 42, 5728–5741 (2014).
Sun, J. X. et al. A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 (2012).
Pratto, F. et al. Recombination initiation maps of individual human genomes. Science 346, 1256442 (2014).
Halldorsson, B. V. et al. Human genetics: characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019).
De Cid, R. et al. Deletion of the late cornified envelope LCE3B and LCE3C genes as a susceptibility factor for psoriasis. Nat. Genet. 41, 211–215 (2009).
Onengut-Gumuscu, S. et al. Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nat. Genet. 47, 381–386 (2015).
Kichaev, G. et al. Leveraging polygenic functional enrichment to improve GWAS power. Am. J. Hum. Genet. 104, 65–75 (2019).
Fritsche, L. G. et al. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat. Genet. 48, 134–143 (2016).
Benonisdottir, S. et al. Sequence variants associating with urinary biomarkers. Hum. Mol. Genet. 28, 1199–1211 (2018).
Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429 (2016).
Evangelou, E. et al. Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat. Genet. 50, 1412–1425 (2018).
Horton, J. D., Cohen, J. C. & Hobbs, H. H. PCSK9: a convertase that coordinates LDL catabolism. J. Lipid Res. 50, S172–S177 (2009).
Raal, F. et al. Low-density lipoprotein cholesterol-lowering effects of AMG 145, a monoclonal antibody to proprotein convertase subtilisin/kexin type 9 serine protease in patients with heterozygous familial hypercholesterolemia: the Reduction of LDL-C with PCSK9 Inhibition in Heterozygous Familial Hypercholesterolemia Disorder (RUTHERFORD) randomized trial. Circulation 126, 2408–2417 (2012).
Cohen, J. C., Boerwinkle, E., Mosley, T. H.Jr & Hobbs, H. H. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Engl. J. Med. 354, 1264–1272 (2006).
Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–1283 (2013).
Kent, S. T. et al. PCSK9 loss-of-function variants, low-density lipoprotein cholesterol, and risk of coronary heart disease and stroke: data from 9 studies of Blacks and whites. Circ. Cardiovasc. Genet. 10, e001632 (2017).
Saevarsdottir, S. et al. FLT3 stop mutation increases FLT3 ligand level and risk of autoimmune thyroid disease. Nature 584, 619–623 (2020).
Balder, J. W. et al. Genetics, lifestyle, and low-density lipoprotein cholesterol in young and apparently healthy women. Circulation 137, 820–831 (2018).
Doege, K. J., Sasaki, M., Kimura, T. & Yamada, Y. Complete coding sequence and deduced primary structure of the human cartilage large aggregating proteoglycan, aggrecan. Human-specific repeats, and additional alternatively spliced forms. J. Biol. Chem. 266, 894–902 (1991).
Allen, H. L. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
Doege, K. J., Coulter, S. N., Meek, L. M., Maslen, K. & Wood, J. G. A human-specific polymorphism in the coding region of the aggrecan gene: variable number of tandem repeats produce a range of core protein sizes in the general population. J. Biol. Chem. 272, 13974–13979 (1997).
Roughley, P. J., Alini, M. & Antoniou, J. The role of proteoglycans in aging, degeneration and repair of the intervertebral disc. Biochem. Soc. Trans. 30, 869–874 (2002).
Schwartz, N. B. & Domowicz, M. Chondrodysplasias. In Reference Module in Biomedical Sciences https://doi.org/10.1016/b978-0-12-801238-3.03764-8 (Elsevier, 2014).
Kiani, C. et al. Structure and function of aggrecan. Cell Res. 12, 19–32 (2002).
Mukamel, R. E. et al. Protein-coding repeat polymorphisms strongly shape diverse human phenotypes. Preprint at bioRxiv https://doi.org/10.1101/2021.01.19.427332 (2021).
Nielsen, J. B. et al. Biobank-driven genomic discovery yields new insight into atrial fibrillation biology. Nat. Genet. 50, 1234–1239 (2018).
Park, C. Y. et al. SkNAC, a Smyd1-interacting transcription factor, is involved in cardiac development and skeletal muscle growth and regeneration. Proc. Natl Acad. Sci. USA 107, 20750–20755 (2010).
Roselli, C. et al. Multi-ethnic genome-wide association study for atrial fibrillation. Nat. Genet. 50, 1225–1233 (2018).
Kong, A. et al. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099–1103 (2010).
Hinch, A. G. et al. The landscape of recombination in African Americans. Nature 476, 170–175 (2011).
McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010).
Touchman, J. W. et al. The genomic region encompassing the nephropathic cystinosis gene (CTNS): complete sequencing of a 200-kb segment and discovery of a novel gene within the common cystinosis-causing deletion. Genome Res. 10, 165–173 (2000).
Rafi, M. A., Luzi, P., Chen, Y. Q. & Wenger, D. A. A large deletion together with a point mutation in the GALC gene is a common mutant allele in patients with infantile Krabbe disease. Hum. Mol. Genet. 4, 1285–1289 (1995).
Luzi, P., Rafi, M. A. & Wenger, D. A. Characterization of the large deletion in the GALC gene found in patients with Krabbe disease. Hum. Mol. Genet. 4, 2335–2338 (1995).
Tappino, B. et al. Identification and characterization of 15 novel GALC gene mutations causing Krabbe disease. Hum. Mutat. 31, E1894–E1915 (2010).
Nioi, P. et al. Variant ASGR1 associated with a reduced risk of coronary artery disease. N. Engl. J. Med. 374, 2131–2141 (2016).
Helgadottir, A. et al. Variants with large effects on blood lipids and the role of cholesterol and triglycerides in coronary disease. Nat. Genet. 48, 634–639 (2016).
Beyter, D., Ingimundardottir, H., Eggertsson, H. P. & Bjornsson, E. Long read sequencing of 1,817 Icelanders provides insight into the role of structural variants in human disease. Preprint at bioRxiv https://doi.org/10.1101/848366 (2019).
Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Ben-Dor, A., Shamir, R. & Yakhini, Z. Clustering gene expression patterns. J. Comput. Biol. 6, 281–297 (1999).
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Benonisdottir, S. et al. Epigenetic and genetic components of height regulation. Nat. Commun. 7, 13490 (2016).
Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Acknowledgements
We thank J. Simpson, our colleagues from deCODE genetics/Amgen Inc. and employees of ONT for their help and advice. We also thank all research participants who provided a biological sample to deCODE genetics.
Author information
Authors and Affiliations
Contributions
D.B. implemented software, with additional software implemented by H.I., H.P.E., S.K., S.M., G.H. and B.V.H. D.B. and B.V.H. wrote the paper with input from H.I., A.O., H.P.E., E.B., H.J., B.A.A., S.K., M.T.H., S.A.G., R.P.K., G.H., G.P., O.A.S., A.H., U.T., H.H., D.F.G., P.S., O.T.M. and K.S. H.I. implemented the analysis pipelines, with input from D.B., S.K., S.A.G., S.T.S., G.M. and B.V.H. D.N.M. and O.T.M. performed ONT sequencing. Aslaug Jonasdottir and Adalbjorg Jonasdottir performed PCR validation experiments. G.E., I.O. and O.S. acquired LDL measurements. H.H. and B.T. acquired heart tissues. B.V.H. and K.S. conceived and supervised the study. All authors approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
D.B., H.I., A.O., H.P.E., E.B., H.J., B.A.A., S.K., M.T.H., S.A.G., D.N.M., Aslaug Jonasdottir, Adalbjorg Jonasdottir, R.P.K., S.T.S., G.H., G.P., O.A.S., G.M., A.H., U.T., H.H., D.F.G., P.S., O.T.M., B.V.H. and K.S. are employees of deCODE genetics/Amgen. The remaining authors declare no competing interests.
Additional information
Peer review information Nature Genetics thanks Mark Chaisson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Oxford Nanopore Technologies (ONT) long-read sequencing statistics.
a, N50 length per flowcell (N = 4,757 flowcells) prior to GRCh38 alignment. b,c,d, Aligned coverage, alignment percentage, and error rates stratified by type, per individual (N = 3,622 individuals). Statistics are computed over sequenced reads longer than 3000 bp. In panel d, box limits indicate upper and lower quartiles, centre line indicates median, and whiskers indicate ±1.5 times the interquartile range.
Extended Data Fig. 2 SquiggleSVFilter overview.
Given a candidate structural variant (SV), and an SV supporting read, SquiggleSVFilter first identifies the subread of the ONT basecalled read overlapping the SV, using the reference alignment BAM file. Next it finds the squiggle slice of the identified subsequence using the event table. For both the left and right flanks around the variant, it determines the reference and alternative sequences given the candidate variant, and computes their raw data-vs-sequence log likelihood scores with the squiggle slice. A sufficiently high log likelihood score difference for the alternate allele marks the read as an SV supporting read.
Extended Data Fig. 3 Allele frequency distribution of SVs at low frequency.
SVs are binned at 0.01% for alleles with 0.1% to 5% frequency.
Extended Data Fig. 4 Length and modulo distributions of structural variants (SVs) that are contained within exons.
a, Length distribution of SVs with lengths between 50 and 100. Stars denote lengths divisible by 3. (N = 224 markers). b, Modulo distribution of SV lengths across length intervals. (N = 549).
Supplementary information
Supplementary Information
Supplementary Methods and Figs. 1–4
Supplementary Tables
Supplementary Tables 1–6
Supplementary Data 1
Sequencing-related information of 4,757 flow cells from 3,622 individuals.
Supplementary Data 2
Summary-level data of high-confidence SVs.
Supplementary Data 3
Primer sequences and results from PCR validation.
Supplementary Data 4
5,238 SVs in strong LD with GWAS catalog variants and related data.
Supplementary Data 5
List of genes with at least one homozygous carrier of a rare high-impact SV allele in our study.
Rights and permissions
About this article
Cite this article
Beyter, D., Ingimundardottir, H., Oddsson, A. et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat Genet 53, 779–786 (2021). https://doi.org/10.1038/s41588-021-00865-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-021-00865-4
This article is cited by
-
A comparison of methods for detecting DNA methylation from long-read sequencing of human genomes
Genome Biology (2024)
-
Rare copy-number variants as modulators of common disease susceptibility
Genome Medicine (2024)
-
Protein-altering variants at copy number-variable regions influence diverse human phenotypes
Nature Genetics (2024)
-
Detection of mosaic and population-level structural variants with Sniffles2
Nature Biotechnology (2024)
-
The association between DNA methylation and human height and a prospective model of DNA methylation-based height prediction
Human Genetics (2024)