Abstract
Short tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, it has proven problematic to genotype STRs from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping and phasing STRs from Illumina sequencing data, and we report a genome-wide analysis and validation of de novo STR mutations. HipSTR is freely available at https://hipstr-tool.github.io/HipSTR.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Mirkin, S.M. Nature 447, 932–940 (2007).
Contente, A., Dittmer, A., Koch, M.C., Roth, J. & Dobbelstein, M. Nat. Genet. 30, 315–320 (2002).
Gymrek, M. et al. Nat. Genet. 48, 22–29 (2016).
Hefferon, T.W., Groman, J.D., Yurk, C.E. & Cutting, G.R. Proc. Natl. Acad. Sci. USA 101, 3504–3509 (2004).
Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. Genome Res. 22, 1154–1162 (2012).
Highnam, G. et al. Nucleic Acids Res. 41, e32 (2013).
Kong, A. et al. Nature 488, 471–475 (2012).
Zook, J.M. et al. Nat. Biotechnol. 32, 246–251 (2014).
The 1000 Genomes Project Consortium. Nature 526, 68–74 (2015).
Mallick, S. et al. Nature 538, 201–206 (2016).
Rosenberg, N.A. et al. PLoS Genet. 1, e70 (2005).
DePristo, M.A. et al. Nat. Genet. 43, 491–498 (2011).
Rimmer, A. et al. Nat. Genet. 46, 912–918 (2014).
Garrison, E. & Marth, G. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Li, H. et al. Bioinformatics 25, 2078–2079 (2009).
Koboldt, D.C. et al. Genome Res. 22, 568–576 (2012).
Willems, T. et al. Genome Res. 24, 1894–1904 (2014).
Estoup, A., Jarne, P. & Cornuet, J.M. Mol. Ecol. 11, 1591–1604 (2002).
Eberle, M.A. et al. Genome Res. 27, 157–164 (2017).
Francioli, L.C. et al. Nat. Genet. 47, 822–826 (2015).
Willems, T., Gymrek, M., Poznik, G.D., Tyler-Smith, C. & Erlich, Y. Am. J. Hum. Genet. 98, 919–933 (2016).
Dempster, A.P., Laird, N.M. & Rubin, D.B. J. R. Stat. Soc. Series B Stat. Methodol. 39, 1–38 (1977).
Albers, C.A. et al. Genome Res. 21, 961–973 (2011).
Hinrichs, A.S. et al. Nucleic Acids Res. 34, D590–D598 (2006).
Benson, G. Nucleic Acids Res. 27, 573–580 (1999).
Li, H. Bioinformatics 31, 3694–3696 (2015).
Li, H. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Fungtammasan, A. et al. Genome Res. 25, 736–749 (2015).
Browning, S.R. & Browning, B.L. Am. J. Hum. Genet. 81, 1084–1097 (2007).
Weisenfeld, N.I. et al. Nat. Genet. 46, 1350–1355 (2014).
Acknowledgements
Y.E. holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was supported by NIJ grant 2014-DN-BX-K089 (T.W., D.Z., A.G., M.G., and Y.E.) and a generous gift by A. Heafy and P. Heafy. Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award number UM1HG008901. We thank Kailos Genetics for providing the 300-bp Illumina sequencing data. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
T.W., M.G., and Y.E. designed the HipSTR algorithm and subsequent analyses. T.W. and A.G. implemented the HipSTR software. T.W. and J.Y. performed the analyses. D.Z. experimentally validated the de novo mutations and analyzed the long MiSeq reads. T.W. and Y.E. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
Y.E. is a consultant for Arc Bio, a company interested in DNA forensics.
Integrated supplementary information
Supplementary Figure 1 Overview of the HipSTR algorithm
In step 1, an expectation maximization algorithm learns the PCR stutter model for the STR of interest. Step 2 utilizes well-anchored alignments that span the STR to identify candidate alleles. HipSTR then builds haplotypes using these alleles and the flanking sequences upstream and downstream of the repeat. In step 3, the PCR stutter model and an HMM are used to align every read to every candidate haplotype and determine the corresponding alignment likelihoods. Step 4 analyzes the likelihoods for all of a sample’s reads to determine its maximum likelihood (ML) genotype. After realigning every read to its sample’s ML genotype, step 5 uses commonly observed stutter artifacts in the alignments to identify new candidate alleles. If any are found, HipSTR returns to step 3 and repeats the process. This iterative procedure continues until no new candidate alleles are identified, at which point the ML genotypes are output to a VCF file. Red dashes and red boxed bases in the rightmost panel indicate deletions or mismatches relative to the candidate haplotype, respectively.
Supplementary Figure 2 HipSTR alignment models
HipSTR uses two distinct types of alignment models to obtain alignment likelihoods. In regions flanking the STR (top), a pairwise hidden Markov model is used to account for Illumina sequencing errors. In contrast, within STR regions (bottom), HipSTR assumes that the main source of artifacts is stutter. If no stutter error occurs, the likelihood of observing a sequence of characters is given by the agreement between the bases in the read (blue) and the corresponding bases on the haplotype (green). Otherwise, HipSTR assumes that a single stutter indel occurs (orange), that it’s a multiple of the motif length M and that it arose at each position with equal probability. The alignment likelihood for these scenarios is then obtained by marginalizing over all configurations.
Supplementary Figure 3 Physical phasing of STRs onto SNP scaffolds
When provided with phased SNP haplotypes, HipSTR analyzes reads that overlap heterozygous SNPs to phase the STR genotypes (blue) onto the SNP haplotypes (red). The schematic provides a conceptual outline of how this would work for an AC repeat flanked by two heterozygous SNPs with known phase. For a quantitative description of HipSTR’s phasing methodology, please refer to the Online Methods.
Supplementary Figure 4 HipSTR in action
Example alignments for sample NA12878 to a particular Marshfield STR before (top) and after (bottom) processing by HipSTR. While the input alignments have many different indel sizes and mismatches with the reference genome, HipSTR’s algorithm resolves the alignments into two parsimonious STR insertions of 4 and 16 base pairs. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.
Supplementary Figure 5 Performance of variant callers in STR regions
The accuracy of each tool’s calls is shown as a function of sensitivity for the Marshfield STR panel. Solid and dashed lines denote tools run using default settings and settings optimized for STR genotyping, respectively. This figure provides a zoomed-out version of the comparison in Figure 1.
Supplementary Figure 6 Effect of coverage on STR genotyping accuracy
The accuracy of HipSTR and GATK-HC calls is shown as a function of sensitivity for the Marshfield STR panel. Each line denotes a tool’s performance when run on sequencing data from the Simons Genome Diversity Project downsampled to the indicated coverage. 41x was the median coverage of the original SGDP dataset.
Supplementary Figure 7 An example of STR homoplasy
The figure depicts an STR located at chr6:16429779 in the hg19 reference genome with an AGAT repeat followed by an ACAT repeat. Each subsequent row depicts the maximum likelihood alignment for reads from sample NA12878 after processing with HipSTR. While all of the reads support a 4bp deletion (red dashes), they support two different STR sequences. In the top 9 reads, one AGAT unit is deleted while the ACAT perfectly matches the reference. In the bottom 8 reads, two copies of AGAT are inserted followed by a deletion of three ACAT copies. HipSTR reports genotypes of (AGAT)8 (ACAT)9 and (AGAT)11(ACAT)6 for NA12878 at this locus. Bases highlighted in yellow denote mismatches relative to the reference genome.
Supplementary Figure 8 Identification of de novo mutations in the Illumina Platinum Genomes CEPH trio
The figure highlights the pedigree and analyses we used to identify de novo STR mutations. We first searched for STR loci where NA12878 (bold, arrow) had an STR allele not observed in her parents. We then validated these loci using orthogonal Illumina datasets and genotyped a small subset of them using Sanger sequencing for further validation.
Supplementary Figure 9 High-confidence de novo mutation detected by HipSTR
The figure depicts an STR located at chr20:45121567 in the hg19 reference genome with an A repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support that both the father and mother match the reference allele, the alignments for NA12878 support the reference allele and a 1bp deletion (red dashes). Only a subset of the alignments for NA12878 (12.5%), NA12891 (50%) and NA12892 (50%) are displayed to facilitate visualization. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.
Supplementary Figure 10 High-confidence de novo mutation detected by HipSTR
The figure depicts an STR located at chr19:37201975 in the hg19 reference genome with a TG repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support that both the father and mother match the reference allele, the alignments for NA12878 support both the reference allele and a 2bp insertion. Only a subset of the alignments for NA12878 (12.5%) and NA12892 (50%) are displayed to facilitate visualization. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.
Supplementary Figure 11 High-confidence de novo mutation detected by HipSTR
The figure depicts an STR located at chr18:27043336 in the hg19 reference genome with an ATC repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support a 0 bp/3bp heterozygous insertion for the father and a 3bp homozygous insertion for the mother, the alignments for NA12878 support a heterozygous 3bp/6bp insertion. Only a subset of the alignments for NA12878 (12.5%), NA12891 (50%) and NA12892 (50%) are displayed to facilitate visualization. Bases highlighted in yellow and red denote mismatches and insertions relative to the reference genome, respectively.
Supplementary Figure 12 High-confidence de novo mutation detected by HipSTR
The figure depicts an STR located at chr1:221172558 in the hg19 reference genome with an ATCT repeat (top row). Each subsequent row depicts a read’s maximum likelihood alignment after processing with HipSTR for the child (NA12878), father (NA12891) and mother (NA12892) in the trio. While the alignments strongly support an 8bp homozygous deletion for both parents (red dashes), the alignments for NA12878 support both a 4bp and 8bp deletion. Only 25% of the alignments for NA12878 are displayed to facilitate visualization. Bases highlighted in yellow denote mismatches relative to the reference genome.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–12 and Supplementary Tables 1–6 (PDF 3371 kb)
Rights and permissions
About this article
Cite this article
Willems, T., Zielinski, D., Yuan, J. et al. Genome-wide profiling of heritable and de novo STR variations. Nat Methods 14, 590–592 (2017). https://doi.org/10.1038/nmeth.4267
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.4267
This article is cited by
-
Validated WGS and WES protocols proved saliva-derived gDNA as an equivalent to blood-derived gDNA for clinical and population genomic analyses
BMC Genomics (2024)
-
Sequencing and characterizing short tandem repeats in the human genome
Nature Reviews Genetics (2024)
-
Epigenetic variation impacts individual differences in the transcriptional response to influenza infection
Nature Genetics (2024)
-
STRAS:a snakemake pipeline for genome-wide short tandem repeats annotation and score
Human Genetics (2024)
-
Characterization and visualization of tandem repeats at genome scale
Nature Biotechnology (2024)